Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability

doi:10.21203/rs.3.rs-7518209/v1

Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability

2025 · doi:10.21203/rs.3.rs-7518209/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 174,953 characters · extracted from preprint-html · click to expand

Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability Rogério Almeida Gouvêa, Pierre-Paul de Breuck, Tatiane Pretto, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7518209/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 15 Jan, 2026 Read the published version in npj Computational Materials → Version 1 posted 10 You are reading this latest preprint version Abstract This study introduces MatterVial, an innovative hybrid framework for feature-based machine learning in materials science. MatterVial expands the feature space by integrating latent representations from a diverse suite of pretrained graph neural network (GNN) models—including structure-based (MEGNet), composition-based (ROOST), and equivariant (ORB) graph networks—with computationally efficient, GNN-approximated descriptors and novel features from symbolic regression. Our approach combines the chemical transparency of traditional feature-based models with the predictive power of deep learning architectures. When augmenting the feature-based model MODNet on Matbench tasks, this method yields significant error reductions and elevates its performance to be competitive with, and in several cases superior to, state-of-the-art end-to-end GNNs, with accuracy increases exceeding 40% for multiple tasks. An integrated interpretability module, employing surrogate models and symbolic regression, decodes the latent GNN-derived descriptors into explicit, physically meaningful formulas. This unified framework advances materials informatics by providing a high-performance, transparent tool that aligns with the principles of explainable AI, paving the way for more targeted and autonomous materials discovery. Physical sciences/Engineering Physical sciences/Materials science Physical sciences/Mathematics and computing Feature-based machine learning MODNet graph neural networks materials informatics interpretability Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Machine learning has revolutionized materials science, accelerating material discovery and property optimization across various domains 1 – 3 . The two prominent approaches in this field are feature-based and graph-neural-network (GNN) models, each with distinct advantages and limitations 4 , 5 . Feature-based models rely on predefined descriptors such as elemental properties, geometric features, and electronic structure information. They are highly interpretable and effective with small datasets, offering insights into structure-property relationships 6 , 7 . These models adapt well to custom tasks in experimental settings, such as nanocrystal research 8 , catalysis 9 , and organic photovoltaics 10 . In contrast, GNN models represent materials as graphs, capturing structural information through message passing and learning deep representations with simple atomic descriptors. This often results in more accurate predictions for complex materials, but requires greater computational resources and data for training 11 , 12 . GNNs are particularly effective in the large-scale screening of materials and for constructing interatomic potentials owing to their efficient computation and local information aggregation, 13 however they lack interpretability. Boosting the accuracy of feature-based models to make them competitive on larger datasets usually implies employing neural network models and relying on extensive suites, such as MatMiner 7 , to produce meaningful features. This process is particularly time-consuming for sophisticated descriptors like the Orbital Field Matrix (OFM) 14 and the Smooth Overlap of Atomic Positions (SOAP) 15 . A novel strategy to boost these feature-based models involves leveraging the rich latent-space representations learned by GNN models pretrained on vast datasets. Even though neural networks are universal function approximators, easing their burden through well-aligned feature transformations can improve generalization, reduce training time, and stabilize convergence 16 , 17 . In this work, we address these challenges by proposing a hybrid approach that combines traditional chemically intuitive descriptors with latent features obtained from a diverse set of pretrained models. We incorporate features from both structure-based (MEGNet, coGN) 18 , 19 and composition-based (ROOST) 20 GNNs, as well as from ORB 21 , a powerful equivariant Machine Learning Interatomic Potential (MLIP). To avoid the featurization bottleneck of traditional descriptors, we also leverage GNNs to generate fast, latent-space approximations of MatMiner (ℓ-MM) and Orbital Field Matrix (ℓ-OFM) features. Finally, we augment this feature set with new descriptors derived via symbolic regression. This multifaceted strategy aims to create a more robust, accurate, and versatile featurizer that capitalizes on the distinct strengths of each approach to be useful for a wider range of dataset sizes. To simplify the generation of all those features, a package was developed named MatterVial standing for MAT erials fea T u R e E xtraction V ia I nterpretable A rtificial L earning, which, besides producing all latent-space features from the GNN models, aids in obtaining the interpretable chemical descriptors that correlate to these high-level features. This is achieved through techniques such as SHapley Additive exPlanations (SHAP) analysis in surrogate models and symbolic regression via Sure Independence Screening and Sparsifying Operator (SISSO) to obtain an approximate formula from the most important features. Our results demonstrate an overall improvement in all analyzed datasets compared with the baseline MatMiner featurizer. In addition, it surpassed the performance of the individual GNN models in several cases, indicating that the combination of traditional and latent-space features leads to a more robust generalization. This work is situated within a recent methodological trend that repurposes GNNs not as end-to-end predictors, but as powerful and data-efficient feature generators for a variety of downstream tasks 22 – 26 . Our approach bridges feature-based and graph-based methods, leveraging their strengths to develop more versatile and task-agnostic machine learning models in materials science. By enhancing the accuracy, efficiency, and interpretability of property prediction, this framework facilitates the integration of both experimental and simulated data. Moreover, it aligns with the growing demand for explainable AI 27 , 28 , which is essential for the advancement of self-driving laboratories in materials discovery and optimization 29 . Results and discussion We evaluate our approach using the full MatBench v0.1 benchmark 30 with MODNet 4 , which is the state-of-the-art feature-based model in materials science 12 . We adopt the same MatMiner featurization as that used in MODNet for MatBench in the original publication 4 , 31 . These can be complemented by three categories of MatterVial features, as illustrated in Fig. 1 : I. Latent-space features from descriptor-oriented GNNs : Conventional material descriptors are transformed into latent representations using an autoencoder trained on Materials Project (MP) data. These descriptors include the widely used MatMiner features (ℓ-MM) and the features from the Orbital Field Matrix featurizer (ℓ-OFM). A GNN was then trained to replicate these latent features directly from the input structures. This method achieves a computational efficiency similar to that of GNNs and still preserves interpretability via decoding. II. Latent-space features from task-oriented GNNs : These features are extracted directly from the intermediate layers of pretrained GNN models that have been developed for various tasks. Specifically, we incorporate MEGNet models from the Materials Virtual Lab (MVL) that were pretrained for the prediction of elastic constants, band gap, and formation energy, as well as for the metal-insulator classification. We also consider composition-based ROOST models for the band gap and formation energy. In addition, we include the internal layers of ORB-v3, a state-of-the-art equivariant MLIP trained to reproduce energies and forces. This group capitalizes on the strengths of GNN architectures in capturing complex structural representations, aiming to enhance predictive performance on larger datasets. III. Symbolically-Derived Feature Combinations : Here, we use the MatMiner features as a basis to generate new compound features. Through symbolic regression with SISSO, we identify several combinations of pairs of features (rung one) that exhibit enhanced correlations with the target properties of interest in materials science. These derived formulas are then incorporated as new features. Since the features obtained from task-oriented GNNs are high-level and not directly interpretable as traditional descriptors, we develop a method to decompose them into interpretable descriptors, which is integrated in the Interpreter module in MatterVial. In addition, features from descriptor-oriented GNNs can be decoded in their interpretable counterparts. Equally, the third group of augmented features via symbolic regression can have their formulas retrieved by name. Comprehensive implementation details for each category and for the Interpreter module are available in the Methods section and in the Supplementary Information . MatBench validation of MatterVial features Table 1 presents the performance of MODNet using MatMiner augmented with MatterVial descriptors (MODNet@MM + MV) and MODNet using only MatterVial descriptors (MODNet@MV) relative to the baseline model using only MatMiner features (MODNet@MM) in the 13 MatBench tasks. The results show that blending both latent-space representations from task-oriented and descriptor-oriented GNNs with symbolically derived features consistently reduces prediction errors across this diverse array of property prediction tasks. Our approach significantly improves the performance on smaller datasets, where feature-based models have traditionally outperformed GNNs. Specifically, our models set new performance records for four tasks previously led by MODNet@MM and now achieve a leading performance in metallicity classification from experimental data. Notably, the glass-forming ability task alone did not result in substantial improvements. We highlight that for smaller composition-based datasets, MatMiner featurization is sufficiently fast to make MODNet@MM + MV computationally effective. For larger datasets, in which traditional featurization is very time consuming, our MODNet@MV models significantly bridge the gap between feature-based and graph-based models, even outperforming state-of-the-art (SOTA) models in predicting properties such as elastic constants, band gap, metallicity, and formation energy. This success demonstrates that our approach effectively addresses the common shortcomings of both feature- and graph-based models. Note, however, that some of the larger MatBench tasks can no longer be considered truly independent test sets for models exposed to vast amounts of similar ab initio data during pretraining. Table 1 Performance comparison of three MODNet variants against the best multi-purpose MatBench model on each task in the MatBench v0.1 benchmark. Metrics are reported as mean absolute error (MAE) for regression and area under the receiver-operator curve (AUROC) for classification. MODNet@MM uses only MatMiner features; MODNet@MM + MV augments these features with MatterVial descriptors; and MODNet@MV uses only MatterVial features essentially substituting MatMiner features by ℓ-MM. For each task, the MatterVial feature group that yields the best result is shown. Scores in bold identify the overall best model per task, and shaded tasks are those in which MODNet was already the best model. MatBench task \(\:n\) MODNet@ MM (baseline) MODNet@ MM + MV (% error reduction*) MODNet@ MV (% error reduction*) MatBench record (model) Best MatterVial groups** Steels yield strength (MPa) 32 \(\:312\) \(\:87.76\) \(\:85.12\:\left(3.0\mathbf{\%}\right)\) 120.95 (-37.8%) MODNet ROOST E exfol . (meV/atom) 33 \(\:636\) \(\:33.19\) \(\:29.19\:\left(12.1\text{\%}\right)\) 28.86 (13.0%) MODNet ORB, MVL, ℓ-OFM, ℓ-MM argmax(PhDOS) (cm - 1 ) 34 \(\:\text{1,265}\) \(\:34.27\) \(\:30.08\) (12.2%) 30.58 (10.8%) \(\:28.76\) (MegNet) MVL, ORB, ℓ-OFM, ROOST, SISSO Exp. band gap (eV) 35 \(\:\text{4,604}\) \(\:0.333\) \(\:0.290\:\left(12.9\mathbf{\%}\right)\) 0.351 (-5.5%) MODNet ROOST, SISSO Refractive index 36 , 37 \(\:\text{4,764}\) \(\:0.271\) \(\:0.235\:\) (13.3%) 0.234 (13.7%) MODNet ORB, ℓ-OFM, MVL, ℓ-MM Exp. metallicity (eV) 35 \(\:\text{4,921}\) \(\:0.916\) \(\:0.976\:\left(71.4\mathbf{\%}\right)\) 0.898 (-59.3%) \(\:0.921\) (AMMExpress) ROOST Glass-forming ability 38 , 39 \(\:\text{5,680}\) \(\:0.936\) \(\:\left(0.960\right)\) † \(\:0.937\) (1.6%) 0.904 (-50.0%) MODNet ROOST Logarithmic G vrh (log 10 GPa) 40 \(\:\text{10,987}\) \(\:0.073\) \(\:0.032\:\left(55.5\mathbf{\%}\right)\) \(\:0.033\:\left(54.8\text{\%}\right)\) \(\:0.067\) (coGN) MVL, ORB, ℓ-MM, ROOST, SISSO Logarithmic K vrh (log 10 GPa) 40 \(\:\text{10,987}\) \(\:0.056\) \(\:0.027\:\left(49.6\mathbf{\%}\right)\) \(\:0.028\:\left(50.1\text{\%}\right)\) \(\:0.049\) (coNGN) MVL, ORB, ℓ-MM, ℓ-OFM, ROOST, SISSO Perovskite ΔH form (eV/unitcell) 41 \(\:\text{18,928}\) \(\:0.0908\) \(\:0.0386\:\left(57.5\text{\%}\right)\) \(\:0.0389\:\left(57.3\text{\%}\right)\) \(\:0.0269\) (coGN) ORB, MVL, ℓ-OFM, ℓ-MM, ROOST, SISSO Band gap (eV) 37 \(\:\text{106,113}\) \(\:0.2199\) \(\:0.137\:\left(37.6\text{\%}\right)\) 0.137 (37.8%) \(\:0.156\) (coGN) MVL, ORB, ROOST, SISSO Metallicity 37 \(\:\text{106,113}\) \(\:0.904\) \(\:0.978\:\left(77.1\mathbf{\%}\right)\) 0.976 \(\:\left(75.0\text{\%}\right)\) \(\:0.9520\) (CGCNN) ORB, MVL, ℓ-OFM, ROOST E f (eV/atom) 37 \(\:\text{132,752}\) \(\:0.0448\) \(\:0.0147\:\left(67.2\text{\%}\right)\) 0.0138 (69.2%) \(\:0.0170\) (coGN) MVL, ORB, ℓ-OFM, ℓ-MM * \(\:\%\:\text{error:reduction:}=\frac{MA{E}_{baseline}-MA{E}_{model}}{MA{E}_{baseline}}\times\:100\%\) \(\:\text{(regression)}\) or \(\:\frac{\left(1-AU{ROC}_{baseline}\right)-\left(1-AURO{C}_{model}\right)}{\left(1-AURO{C}_{baseline}\right)}\times\:100\%\) (classification) ** Ordered by importance, MVL, ORB and ROOST refer to the task-oriented GNN features, respectively those from MVL MEGNet models for structures, the MLIP Orb-v3 and pretrained ROOST models for compositions. ℓ-MM and ℓ-OFM refer to the descriptor-oriented GNN features, ℓ-MM when included, substitutes the MatMiner features for faster generation. SISSO refers to the group of features derived from MM features via symbolic regression. † As we were unable to replicate the reported 0.960 AUROC for glass formability using MODNet, we present our best MODNet@MM result as baseline instead. Despite the lower score, MODNet continues to outperform other models in MatBench for this task. An analysis of the feature contributions in Table 1 reveals that task-oriented latent features are the primary drivers of performance gains. The inclusion of ROOST aimed at enhancing performance in composition-based tasks, and yet the model has reliably improved results in a wide range of tasks that also contained structural information. This performance may be attributed to the attention mechanism that captures unique patterns during activation and material pooling. For structure-based tasks, MVL-derived features have shown a significant positive impact. They boost predictions even when the prediction targets differ from those used in the original models, such as in predicting the perovskite heat of formation and refractive index. The ORB features, derived from an equivariant MLIP, proved particularly impactful, frequently appearing as top contributors. This is chemically intuitive, as the model's training on energies and forces provides a rich, physically meaningful latent space that is useful for transfer learning. This aligns with very recent findings by Kim et al. 26 , who also employed ORB features with MODNet for structure-based regression tasks. Our approach achieves enhanced performance by incorporating all Orb-v3 layers and combining these features with diverse descriptor groups within our framework. The descriptor-oriented and symbolically derived features also provided consistent complementary improvements. The ℓ-OFM features improved performance across most tasks, validating that our GNN-based approximation is an efficient and effective method for incorporating the descriptive power of computationally expensive descriptors like the Orbital Field Matrix. The ℓ-MM features, designed as a shortcut for MatMiner features via GNN, lead to improved or similar performance on many tasks. Compared to the models that used the full MatMiner features (MODNet@MM + MV), we argue that the reconstruction loss was sufficiently low and that, for some cases, the encoder effectively refined the representation via regularization, improving the metrics. Crucially, these latent-space representations remain decodable, preserving much of the interpretability, which is a hallmark of feature-based models. Finally, the SISSO-derived features, while less universally impactful, still boosted performance in roughly half of the benchmarks. Given that we utilized only first-rung symbolic regression, we conjecture that there is clear potential for further gains with higher-level, more complex formulas. Ultimately, these results show that our approach simultaneously accelerates featurization, improves model performance, and provides valuable chemical insights. This combination of benefits repositions feature-based models as strong and practical alternatives to end-to-end GNNs for property prediction. Synergy of MatterVial features and adjacent GNN model Having demonstrated the performance gains of our method, we now turn to the individual contributions of the MatterVial features. We examine the synergetic effects of each MatterVial feature group using the perovskite heat of formation task as an example. Table 2 illustrates a step-by-step performance evaluation for this task, revealing how the integration of different MatterVial feature groups leads to cumulative improvements. Starting from our baseline, the MODNet@MM model delivers an MAE of 0.0888 eV/unit cell. This performance serves as a reference point against which the benefits of the additional features can be measured. The first modification involves introducing descriptor-oriented GNN features, ℓ-OFM and ℓ-MM, which are designed to be computationally faster approximations of their full counterparts. When MatMiner features are entirely replaced by their latent representation (MODNet@ℓ-MM), the MAE is 0.1052 eV/unit cell. While higher than our MODNet@MM baseline (0.0888 eV/unit cell), this still significantly outperforms AutoMatMiner (0.2005 eV/unit cell), demonstrating ℓ-MM as a viable, faster featurization alternative. Augmenting MatMiner with ℓ-OFM (MODNet@MM+ℓ-OFM) reduces the MAE to 0.0794 eV/unit cell. This is lower than the baseline, although still higher than that obtained using the original computationally intensive OFM features (MODNet@MM + OFM, 0.0751 eV/unit cell). Combining both ℓ-MM and ℓ-OFM (MODNet@ℓ-MM+ℓ-OFM) yields an MAE of 0.0973 eV/unit cell. These results highlight that our proxy GNN featurizers offer a compelling trade-off, capturing essential chemical information with a substantial speed-up in featurization. Building on this foundation, the incorporation of task-oriented GNN features from the MVL pretrained models further boosts performance in MODNet@ℓMM+ℓOFM + MVL model, lowering the MAE to 0.0673 eV/unit cell. Clearly, the MVL descriptors capture additional structural and physicochemical details that the MM and OFM features do not, thereby enhancing the ability of the model to predict heat formation (more details on the MVL descriptions and the effect of different layers are given in the Supplementary Information , section S6). Next, the addition of symbolically derived feature combinations via SISSO produces modest refinement, reducing the MAE to 0.0653 eV/unit cell. Although the improvement is small, it underscores the notion that simple algebraic combinations of conventional descriptors can reveal non-linear relationships, complement the latent-space features, and thereby enhance prediction accuracy. Further refinement is achieved by incorporating composition-based ROOST features. At first glance, one might not expect an improvement over the MEGNet MVL models since they incorporate structural information alongside composition. However, we believe that the attention-based mechanism present in ROOST is responsible for capturing additional meaningful information to complement other feature groups and achieve an MAE of 0.0639 eV/unit cell. Furthermore, at this point, using the standard MatMiner features instead of their latent representation (ℓMM) yields a nearly equivalent performance (MAE of 0.0637 eV/unit cell). These results confirm that the rapidly generated encoded representations can effectively replace the full MatMiner features in tandem with other descriptors. However, eliminating MatMiner features entirely (neither MM nor ℓ-MM), causes a significant decrease in accuracy with 0.0707 eV/unit cell in MODNet@MVL + ROOST and 0.0716 eV/unit cell in MODNet@MVL, indicating that the MatMiner features are valuable and not simply redundant to these GNN descriptors. In fact, a synergistic effect among all MatterVial feature groups is observed in this dataset. ORB features stand apart from other featurizers like MVL and ROOST. While MVL and ROOST were trained on smaller datasets, specifically MP and OQMD 42 (about 1.5 million structures combined), the ORB-v3 featurizer was trained on a significantly larger dataset. This dataset, which combines MP, Alexandria 43 , and OMat 44 , leverages approximately 120 million calculated structures, a number at least two orders of magnitude larger than either those datasets. The extraction of features from this model in MatterVial to use in MODNet significantly reduces the mean absolute error in the task, but a slight improvement is still seen with the other MatterVial features that were included. We conjecture that larger reductions might still be achievable by training more task- and descriptor-oriented models in these larger datasets. Despite the significant reduction, feature-based approaches using pretrained models with MatterVial or HackNIP still fall short of the results obtained purely with GNNs such as MEGNet and coGN trained in the perovskites dataset. Based on this observation, we incorporate into MatterVial the possibility of training adjacent GNN models on the fly and extracting their features with the AdjacentGNNFeaturizer class. We achieve 0.0343 eV/unit cell using the MEGNet adjacent model features. The MEGNet benchmarked MAE is substantially lower than what we achieved using the default configuration of the model, even with the same elemental embeddings provided by the authors. This discrepancy is possibly due to differences in hyperparameters, inclusion of additional features, and larger training times employed for the benchmark 45 . Finally, we employed the SOTA coGN model as an adjacent model for feature extraction, and we obtained comparable results to the reported values in MatBench with this model. Incorporating coGN features in our MODNet model reduced the MAE to 0.0313 eV/unit cell, which is much closer to the 0.0269 eV/unit cell record. Table 2 Mean absolute errors (MAEs) for the MatBench task of the heat of formation of perovskites with different models. Reference models MAE (eV/unit cell) MatterVial models MAE (eV/unit cell) Descriptor-oriented AutoMatMiner (MatBench*) \(\:0.2005\:(\pm\:0.0085)\) MODNet@ℓ-MM \(\:0.1052\) \(\:(\pm\:0.0022)\) MODNet@MM (this work) \(\:0.0888\:(\pm\:0.0025)\) MODNet@MM+ℓ-OFM 0.0794 \(\:(\pm\:0.0016)\) MODNet@MM + OFM 0.0751 \(\:0.0888\) \(\:(\pm\:0.0018)\) MODNet@ℓ-MM+ℓ-OFM 0.0973 \(\:(\pm\:0.0016)\) Task-oriented (MVL, ROOST) MODNet@MVL 0.0716 ( \(\:\pm\:\) 0.0020) MODNet @ℓ-MM+ℓ-OFM + MVL 0.0673 ( \(\:\pm\:\) 0.0015) MODNet@MVL + ROOST 0.0707 ( \(\:\pm\:\) 0.0017) MODNet@ℓ-MM+ℓ-OFM+ +MVL + SISSO 0.0653 ( \(\:\pm\:\) 0.0013) MODNet@MM+ℓ-OFM + + MVL + SISSO + ROOST 0.0637 ( \(\:\pm\:\) 0.001) MODNet@ℓ-MM+ℓ-OFM + + MVL + SISSO + ROOST 0.0639 ( \(\:\pm\:\) 0.0010) Task-oriented (ORB featurizer) MODNet@MM+ℓ-OFM+ +MVL + SISSO + ROOST + ORB \(\:0.0386\) \(\:(\pm\:0.0009\) ) HackNIP 26 (MODNet@ORB) 0.0397 MODNet@MV † \(\:0.0388\) \(\:(\pm\:0.0006\) ) MV + Adjacent GNN model MEGNet (MatBench*) \(\:0.0352\:(\pm\:0.0016)\) MEGNet (this work) \(\:0.0685\:(\pm\:0.0036)\) MODNet@ MV + Adj(MEGNet) \(\:0.0343\) \(\:(\pm\:0.0014\) ) coGN (MatBench*) \(\:0.0269\:(\pm\:0.0008\) ) MODNet@ MV + Adj(coGN) \(\:0.0313\:(\pm\:0.0012\) ) coGN (this work) \(\:0.0271\:(\pm\:0.0008\) ) MODNet@ MV + Adj(coGN) + hiSISSO 0.0288 \(\:(\pm\:0.0009\) ) *Data retrieved from MatBench 12 in August 2025. † For brevity MV = (ℓ-MM+ℓ-OFM + MVL + SISSO + ROOST + ORB), i.e. all pretrained featurizers in MatterVial. Figure 2 graphically depicts the synergistic effects detailed in Table 2 by comparing different models. The feature importance from the mean absolute SHAP values aggregated by feature group in Fig. 2 (a-c) quantifies the contribution of the different groups of features and shows a clear shift in dominance as more powerful features are introduced. However, a closer inspection reveals important nuances regarding how these feature sets interact. In the MODNet@MV(no ORB) model, there is a relatively balanced and significant contribution from all feature groups, led by MVL, ℓ-MM, and SISSO, underscoring their collective utility. This is seen more clearly with the t-SNE projections of the SHAP value vectors for each feature in the model, where we can see these three sets of features covering most regions of the projection, but still some contributions of ROOST and ℓ-OFM features. When ORB features are introduced (Fig. 2 b and 2 e), they become the dominant contributor, explaining the dramatic reduction in MAE observed in Table 2 . Crucially, the ℓ-MM and SISSO features retain a significant portion of their importance with SISSO, being even among the highest contributors. This indicates that they capture complementary chemical information not fully encapsulated within the ORB latent space, explaining the slightly better result obtained compared to HackNIP’s MODNet@ORB model 26 . This hierarchical and synergistic contribution of features directly explains the visual improvement in the data manifold shown in the t-SNE projections (Fig. 2 g-i). The feature space of the MODNet@MV(no ORB) model (Fig. 2 g) shows some organization. However, the introduction of ORB features (Fig. 2 h) creates a significantly more structured manifold with a smoother gradient along the target property. This synergistic contribution continues in the final MODNet@MV + Adj(coGN) model. The inclusion of adjacent coGN features (Fig. 2 i) results in the most well-defined feature space in the t-SNE projection, with the clearest separation between data points according to the target feature. While the task-specific coGN features predictably take the lead, the pretrained ORB and MVL features remain highly influential, serving as the second- and third-most important groups, respectively (Fig. 2 c and 2 f). In contrast, the contributions from ℓ-MM and SISSO are now marginal, as their predictive information has been superseded by more powerful GNN features. This layered view of contributions highlights the interpretability brought by feature-based models. In the following section, we showcase how this interpretability can be deepened using new MatterVial tools. Interpretability of MatterVial features We begin by analyzing the most important features of the MODNet@MV(no ORB) model to understand what factors increase its accuracy in predicting the perovskite heat of formation (ΔH f ). Unlike end-to-end GNNs, where features are deeply entangled through message passing, feature-based models have readily decoupled features, and SHAP values can be used to robustly assess the most important ones, as shown in the plot in Fig. 3 . Utilizing the MatterVial Interpreter module, we can easily obtain SISSO formulas with up to five terms to approximate the GNN features of the included pretrained models. These approximations are based on interpretable descriptors from MatMiner and OFM. The plot displays the one-term formulas and their corresponding R² values, demonstrating that even with relatively simple descriptors, these approximation formulas can achieve high R² values for many meaningful GNN features. This analysis identifies several key feature groups that drive the predictions. Features from the MVL formation energy model, for instance, correlate stability with large electronegativity gaps (promoting ionic character) and low d-electron fractions, which favor early transition metals. Features generated by SISSO highlight structural drivers, rewarding dense atomic packing, ordered coordination environments, and specific stabilizing factors like 3Å interatomic contacts, while penalizing destabilizing electronic effects from excess d-electrons. Compositional features from ROOST and encoded MatMiner (ℓ-MM) models capture broader trends, showing that perovskites made of heavier, chemically diverse elements tend to be less stable and illustrating the balance between destabilizing wide-band-gap elements and the stabilizing effect of species with many unfilled d-states. Finally, encoded OFM (ℓ-OFM) features provide a granular view of bonding, distinguishing between the stabilizing interactions characteristic of oxides (e.g., s 2 − p 4 ) and weaker bonds involving pnictogens. Collectively, this demonstrates that the model learns a multi-faceted and physically grounded understanding of perovskite stability. A full breakdown of the individual features shown in the figure is provided in the Supplementary Information , section S10. A comparative SHAP analysis of the best-performing MODNet@MV and MODNet@MV + adj(coGN) models, which incorporate richer ORB and coGN features (see SI, section S11, Figs. S4-S7), showed that while the MODNet@MV(no ORB) model primarily relies on fundamental chemical descriptors, the addition of ORB features shifts the emphasis of the model toward geometric information such as packing efficiency. The top-performing MODNet@MV + adj(coGN) model builds on this by capturing the most sophisticated features, representing a complex interplay between chemical and geometric properties. This increase in predictive power is accompanied by a decrease in direct interpretability. As the models become more complex, the ability to approximate their most important features with simple SISSO formulas diminishes (indicated by progressively lower R 2 values), and their correlation with classical descriptors weakens (Table S13). This progression highlights the gap between the complex features of high-performing GNNs and the limited descriptive power of interpretable descriptors, emphasizing the need for more flexible descriptors that remain compact for symbolic regression methods and interpretability. To test the utility of our GNN feature approximations, we conducted a two-stage experiment. In the first stage, we compared two types of SISSO models: a baseline using only MatMiner and OFM descriptors, and an enhanced version that added formulas approximating the GNN's most important features. For both model types, we apply a consistent methodology, utilizing several primary feature pre-selection algorithms—including mRMR (i-SISSO) 46 , random forest importances (rf-SISSO) 47 , and our xgb-rfe-SISSO (SI, Sec. S8). The addition of the GNN-derived features yields a significant and consistent reduction in prediction error, as shown in Fig. 4 (a). Our approach is analogous to hierarchical SISSO (hiSISSO) 48 , but it uniquely feeds back approximations of learned GNN features rather than terms from a prior SISSO model. In the second stage, we extract the terms from this enhanced SISSO model and incorporate them as new "hiSISSO features" to augment the MODNet@MV + adj(coGN) model. This augmented model further reduces the error to 0.0288 eV/unit cell. The t-SNE projection of SHAP value contributions and average feature importance of the classes in Fig. 4 (b,c) confirm their effectiveness, showing the high per-feature predictive power of hiSISSO features complementing the model. This demonstrates that explicit, interpretable formulas can improve generalization and raises the compelling question of whether GNN features could be replaced entirely if more expressive, physically grounded descriptors were available. Conclusion In this work, we introduced MatterVial, a unified and modular hybrid framework designed to bridge the gap between the predictive power of graph neural networks (GNNs) and the chemical transparency of traditional feature-based models in materials science. By augmenting the state-of-the-art feature-based model MODNet with a diverse and synergistic set of descriptors, this approach elevates its performance to be competitive with, and in several cases superior to, end-to-end GNNs. To summarize our contributions: (i) MatterVial is a novel open-source Python framework that generates a rich hybrid feature set. It integrates latent-space representations from various pretrained models, including structure-based GNNs such as MEGNet, an equivariant interatomic potential (ORB), and composition-based networks such as ROOST. The framework also uses computationally efficient GNN-approximated descriptors (ℓ-MM, ℓ-OFM) and features derived from symbolic regression. (ii) The hybrid model demonstrates broad applicability and superior performance across the full MatBench v0.1 benchmark. It consistently reduces prediction errors across nearly all 13 tasks and establishes new state-of-the-art records for feature-based models in several categories. (iii) A key innovation is a method that systematically decodes abstract GNN-derived features into more intuitive formulaic descriptors. This is achieved using surrogate models and symbolic regression to translate latent representations into explicit mathematical expressions based on fundamental physicochemical properties. (iv) By incorporating features from an adjacent, task-specific GNN model, the framework enables a feature-based model to achieve predictive accuracy that is highly competitive with state-of-the-art GNNs while uniquely maintaining a modular and analyzable feature space. (v) It was demonstrated that the interpretable formulas extracted from GNNs can be fed back into the model as new "hiSISSO features", leading to a further reduction in prediction error. This confirms that the interpretability method can capture causally relevant physical information. In conclusion, this work repositions feature-based modeling as a premier methodology in materials informatics. It delivers a practical solution that meets the dual demands of high accuracy and interpretability, a combination that is becoming increasingly critical in the field. While predictive accuracy is essential, interpretability allows researchers to validate that models have learned physically meaningful principles, thereby building trust and moving beyond simple prediction to genuine scientific understanding. This deeper insight accelerates materials discovery by enabling a shift from brute-force screening to more targeted, hypothesis-driven design. Ultimately, this alignment with the principles of explainable AI is a prerequisite for developing the next generation of autonomous discovery platforms, or “self-driving labs”, which require models that can not only predict outcomes but also explain the underlying principles to guide subsequent experiments. Methods MODNet model training The MatMiner featurizer used throughout this work is described in detail in the Supplementary Information , section S1. For all experiments incorporating MatterVial features, since many features are obtained, we perform an initial preselection of features using recursive feature elimination with XGBoost 49 to reduce the pool to 800 features. Subsequently, the built-in MODNet feature selection algorithm is used to select and rank a subset of these features that will be used for training. At this point, we can determine which groups of MatterVial features are relevant for a given task (“Best MatterVial groups” in Table 1 ). The MODNet models are optimized via a genetic algorithm to select the best hyperparameters, and the optimal models in the validation set form deep ensembles, as described in Ref. 26, which are then used for evaluation in the test set and to obtain the final metrics. The mean absolute error (MAE) serves as the primary evaluation metric in regression tasks, and for classification tasks, the area under the receiver-operator curve (AUROC) is used. We consistently use a five-fold cross-validation method, as described in Matbench 25 , in all presented tasks. A Supplementary data repository with detailed results of our work is available at https://github.com/rogeriog/MatterVial_SupportData . MatterVial implementation MatterVial is an open-source featurizer tool implemented in Python (available at https://github.com/rogeriog/MatterVial ) to enhance material property predictions by integrating pretrained descriptor-oriented and task-oriented GNNs, as well as precomputed symbolic formulas from traditional chemically intuitive descriptors. The package offers significant flexibility and modularity, allowing the extraction of features from different layers of pretrained models and the incorporation of other GNN models as needed. The following outlines each MatterVial featurizer employed: ● ℓ-OFM featurizer : the OFM featurizer captures valence electron interactions at each atomic site by employing a weighted vector outer product of one-hot encoded valence orbitals for every atom (details in the Supplementary Information , section S2, Fig. S1 ). The structural representation is achieved by averaging all local OFMs. We apply the OFM featurizer to a subset of the Materials Project MP-crystals-2018.6.1 50 dataset with 106,113 structures whose energy above the convex hull was lower than 150 meV, nicknamed MP2018-stable, followed by training an autoencoder to derive a latent space representation. The latent OFM features are subsequently used as targets to train a GNN model that generates these features directly from the initial structures. ● ℓ-MM featurizer : following a similar procedure to the OFM featurizer, we encode features obtained from the default MatMiner featurizer of MODNet v.0.1.13 applied to the MP2018-stable dataset, resulting in 1,336 MatMiner features. The selected compression level provides latent MatMiner features (ℓ-MM), which are then used as targets to train a GNN model that directly generates these features from the original structures. The DescriptorMEGNetFeaturizer class in the MatterVial package is implemented to retrieve OFM-encoded and MatMiner-encoded features from the MatterVial package. A thorough investigation of these encoded features, including the use of different compression levels and hyperparameters was conducted, as detailed in the Supporting Information (sections S4, S5, S7 and Fig. S3, also Tables S1–S3, S5–S8, S10–S11). ● MVL MatterVial featurizers : Utilizing the MVLFeaturizer class from the MatterVial package, we incorporate five pretrained MEGNet models provided by the Materials Virtual Lab 50 . Specifically, these are the models trained for the formation energy, Fermi energy, and elastic constants K VRH and G VRH on the 2019.4.1 Materials Project crystals dataset, as well as the band gap regression model trained on the 2018.6.1 Materials Project crystals dataset. The default MEGNet architecture comprises MEGNet blocks followed by an MLP with two dense layers, one with 32 neurons and the other with 16 neurons, before producing the target property (see section S3, Fig. S2, in Supplementary Information ). The modularity of the MatterVial package allows us to extract features from different layers of these pretrained models. We extract features from the MLP layers preceding the output, specifically from the 32-neuron (layer32) and 16-neuron (layer16) configurations. An investigation was conducted on the effect of using the different layers for prediction as provided in Supplementary Information , section S6, Table S4. For this paper, the extracted features of both layers (160 descriptors for layer32 and 80 descriptors for layer16) are concatenated and added to the final feature vector. ● Adjacent GNN featurizer : The AdjacentGNNFeaturizer class from the MatterVial package is employed to train a MEGNet or coGN model on the fly for each fold of the train-test split. This adjacent model captures task-specific data nuances, enhancing prediction accuracy. The default hyperparameters from MEGNet v.1.3.2 and coGN are utilized, as detailed in the Supplementary Information , section S7.2, Table S9. ● SISSO-based formula featurizer : The SISSO + + framework 51 was used to generate symbolic expressions that approximate target material properties across 15 datasets (see Supplementary Information , section S8, Table S12 for details) by transforming MatMiner features. The method begins by recursively applying a predefined set of operators (e.g., addition, subtraction, multiplication, division, sine, cosine, exponential, and logarithm) to expand the feature space, followed by sure-independence screening (SIS) that ranks the resulting candidates by their correlation with the target property and a sparsification step that selects a compact descriptor set. For our configuration, restricted to rung one, this yields 20 paired-feature formulas. By opting for the expressions produced at the SIS step instead of the final SISSO formula, versatility and generalization are assured when integrated with MODNet neural networks. These formulas, derived for each of the 15 tasks, are compiled in the file SISSO_FORMULAS_v1.txt, which is accessed by the get_sisso_features function in MatterVial to process the given MatMiner features (either directly or decoded from ℓ-MM) and outputs a dataframe of evaluated expressions. In terms of computational cost, generating the complete feature set with MatterVial is substantially more efficient than traditional MatMiner featurization. Although the precise runtime for MatMiner is highly dataset-dependent, our observations indicate that MatterVial reduces feature generation time by a minimum of two orders of magnitude, especially when leveraging GPUs. Retrieving interpretability via MatterVial’s interpreter module Before employing MatterVial’s interpreter module, we conduct a SHAP value analysis (see Supplementary Information , section S9) on our MODNet models to assess feature importance. Using the SHAP Python library, we perform the analysis with 300 samples and 500 perturbations on 24 CPU cores in about 20 minutes, revealing the features with the greatest impact on the model predictions. To bridge the gap between high-level latent representations and interpretable chemical descriptors, MatterVial leverages surrogate XGBoost models. These models are trained to predict each latent feature based on the previous assessment using the MP2018-stable dataset featurized with interpretable MatMiner and OFM features. The tree-based additive structure of XGBoost ensures rapid and parallel training as well as efficient SHAP calculations. For each feature, the top 30 most influential interpretable descriptors, as determined by SHAP, are forwarded to SISSO++, which performs symbolic regression to retrieve a symbolic formula that better correlates with the latent feature. This process is illustrated in Fig. 5 . The SHAP decompositions and SISSO formula for the latent-space features computed this way can be retrieved by calling the Interpreter class and invoking get_shap_values or get_formula with the feature name as generated by MatterVial. Moreover, this interpretability framework extends to adjacent GNN models. Using MatterVial’s tools, including the AdjacentGNNFeaturizer, a task-specific GNN model is trained and its latent features can be interpreted following the previous pipeline for which helper functions are provided. In this way, both the pretrained models imported by MatterVial and the adjacent models trained on the fly benefit from enhanced transparency, enabling users to decode the underlying chemical principles driving the predictions. Declarations Acknowledgements We acknowledge the supercomputing facilities of the Université catholique de Louvain (CISM/UCL) and the Consortium des Équipements de Calcul Intensif en Fédération Wallonie Bruxelles (CÉCI) for computational resources. This work also made use of Lucia, the Tier-1 supercomputer of the Walloon Region. Funding This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brazil (CAPES) – Finance Code 001 and Université catholique de Louvain (Ref. No. ARH/MKK/01155839). Author contributions R.G., P.D.B., and G.M.R. conceptualized the study. R.G. performed the experiments and data analysis, and wrote the main manuscript. P.D.B. and G.M.R. supervised the investigation. R.G., P.D.B., T.P., G.M.R. and M.J.L.S. reviewed and approved the final manuscript. Competing interests The authors declare no competing interests. Data Availability A Supplementary data repository with detailed results of our work is available at https:/github.com/rogeriog/MatterVial_SupportData . References Rodrigues, J. F., Florea, L., De Oliveira, M. C. F., Diamond, D. & Oliveira, O. N. Big data and machine learning for materials science. Discov Mater 1 , 12 (2021). Dey, A. et al. State of the Art and Prospects for Halide Perovskite Nanocrystals. ACS Nano 15 , 10775–10981 (2021). Guo, K., Yang, Z., Yu, C.-H. & Buehler, M. J. Artificial intelligence and machine learning in design of mechanical materials. Mater. Horiz. 8 , 1153–1172 (2021). De Breuck, P.-P., Hautier, G. & Rignanese, G. M. Materials property prediction for limited datasets enabled by feature selection and joint learning with MODNet. npj Computational Materials 7 , 1–8 (2021). Zhang, B., Zhou, M., Wu, J. & Gao, F. Predicting the Materials Properties Using a 3D Graph Neural Network With Invariant Representation. IEEE Access 10 , 62440–62449 (2022). Tawfik, S. A. & Russo, S. P. Naturally-meaningful and efficient descriptors: machine learning of material properties based on robust one-shot ab initio descriptors. J Cheminform 14 , 78 (2022). Ward, L. et al. Matminer: An open source toolkit for materials data mining. Computational Materials Science 152 , 60–69 (2018). Pretto, T., Baum, F., Gouvêa, R. A., Brolo, A. G. & Santos, M. J. L. Optimizing the Synthesis Parameters of Double Perovskites with Machine Learning Using a Multioutput Regression Model. J. Phys. Chem. C 128 , 7041–7052 (2024). Liu, J. et al. Toward Excellence of Electrocatalyst Design by Emerging Descriptor‐Oriented Machine Learning. Adv Funct Materials 32 , 2110748 (2022). Kim, G.-H., Lee, C., Kim, K. & Ko, D.-H. Novel structural feature-descriptor platform for machine learning to accelerate the development of organic photovoltaics. Nano Energy 106 , 108108 (2023). Li, S. et al. Encoding the atomic structure for machine learning in materials science. WIREs Comput Mol Sci 12 , e1558 (2022). Dunn, A. MatBench Leaderboard. https://matbench.materialsproject.org/ (2024). Riebesell, J. et al. A framework to evaluate machine learning crystal stability predictions. Nat Mach Intell 7 , 836–847 (2025). Lam Pham, T. et al. Machine learning reveals orbital interaction in materials. Science and Technology of Advanced Materials 18 , 756–765 (2017). Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87 , 184115 (2013). Kanter, J. M. & Veeramachaneni, K. Deep feature synthesis: Towards automating data science endeavors. in 2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA) 1–10 (IEEE, Campus des Cordeliers, Paris, France, 2015). doi:10.1109/DSAA.2015.7344858. Arik, S. O. & Pfister, T. TabNet: Attentive Interpretable Tabular Learning. Preprint at https://doi.org/10.48550/arXiv.1908.07442 (2020). Ruff, R., Reiser, P., Stühmer, J. & Friederich, P. Connectivity Optimized Nested Graph Networks for Crystal Structures. Preprint at http://arxiv.org/abs/2302.14102 (2023). Ko, T. W. et al. Materials Graph Library (MatGL), an open-source graph deep learning library for materials science and chemistry. Preprint at https://doi.org/10.48550/arXiv.2503.03837 (2025). Goodall, R. E. A. & Lee, A. A. Predicting materials properties without crystal structure: deep representation learning from stoichiometry. Nat Commun 11 , (2020). Rhodes, B. et al. Orb-v3: atomistic simulation at scale. Preprint at https://doi.org/10.48550/arXiv.2504.06231 (2025). Shiota, T., Ishihara, K. & Mizukami, W. Universal neural network potentials as descriptors: towards scalable chemical property prediction using quantum and classical computers. Digital Discovery 3 , 1714–1728 (2024). El-Samman, A. M. et al. Global geometry of chemical graph neural network representations in terms of chemical moieties. Digital Discovery 3 , 544–557 (2024). El-Samman, A. M., De Castro, S., Morton, B. & De Baerdemacker, S. Transfer learning graph representations of molecules for pKa, 13 C-NMR, and solubility. Can. J. Chem. 102 , 275–288 (2024). Elijošius, R. et al. Zero shot molecular generation via similarity kernels. Nat Commun 16 , 5991 (2025). Kim, S. Y., Park, Y. J. & Li, J. Leveraging neural network interatomic potentials for a foundation model of chemistry. Preprint at https://doi.org/10.48550/arXiv.2506.18497 (2025). Oviedo, F., Ferres, J. L., Buonassisi, T. & Butler, K. T. Interpretable and Explainable Machine Learning for Materials Science and Chemistry. Acc. Mater. Res. 3 , 597–607 (2022). Pilania, G. Machine learning in materials science: From explainable predictions to autonomous design. Computational Materials Science 193 , (2021). Abolhasani, M. & Kumacheva, E. The rise of self-driving labs in chemical and materials sciences. Nat. Synth 2 , 483–492 (2023). Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. npj Computational Materials 6 , 1–10 (2020). De Breuck, P.-P., Evans, M. L. & Rignanese, G.-M. Robust model benchmarking and bias-imbalance in data-driven materials science: a case study on MODNet. J. Phys.: Condens. Matter 33 , 404002 (2021). Citrine Informatics. Mechanical properties of some steels. Choudhary, K., Kalish, I., Beams, R. & Tavazza, F. High-throughput Identification and Characterization of Two-dimensional Materials using Density functional theory. Sci Rep 7 , 5179 (2017). Petretto, G. et al. High-throughput density-functional perturbation theory phonons for inorganic materials. Sci Data 5 , 180065 (2018). Zhuo, Y., Mansouri Tehrani, A. & Brgoch, J. Predicting the Band Gaps of Inorganic Solids by Machine Learning. J. Phys. Chem. Lett. 9 , 1668–1673 (2018). Petousis, I. et al. High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials. Sci Data 4 , 160134 (2017). Jain, A. et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL materials 1 , 11002 (2013). Kawazoe, Y., Masumoto, T., Tsai, A.-P., Yu, J.-Z. & Aihara Jr., T. 1 Introduction. in Nonequilibrium Phase Diagrams of Ternary Amorphous Alloys (eds. Kawazoe, Y., Yu, J.-Z., Tsai, A.-P. & Masumoto, T.) vol. 37A 1–32 (Springer-Verlag, Berlin/Heidelberg, 1997). Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput Mater 2 , 16028 (2016). De Jong, M. et al. Charting the complete elastic properties of inorganic crystalline compounds. Sci Data 2 , 150009 (2015). Castelli, I. E. et al. New cubic perovskites for one- and two-photon water splitting using the computational materials repository. Energy Environ. Sci. 5 , 9034 (2012). Shen, J. et al. Reflections on one million compounds in the open quantum materials database (OQMD). J. Phys. Mater. 5 , 031001 (2022). Schmidt, J. et al. Machine‐Learning‐Assisted Determination of the Global Zero‐Temperature Phase Diagram of Materials. Advanced Materials 35 , 2210788 (2023). Barroso-Luque, L. et al. Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models. Preprint at https://doi.org/10.48550/arXiv.2410.12771 (2024). matbench_v0.1: MegNet (kgcnn v2.1.0) - MatBench. Materialsproject.org https://matbench.materialsproject.org/Full%20Benchmark%20Data/matbench_v0.1_MegNet_kgcnn_v2.1.0/ (2020). Xu, Y. & Qian, Q. i-SISSO: Mutual information-based improved sure independent screening and sparsifying operator algorithm. Engineering Applications of Artificial Intelligence 116 , 105442 (2022). Jiang, X., Liu, G., Xie, J. & Hu, Z. Boosting SISSO Performance on Small Sample Datasets by Using Random Forests Prescreening for Complex Feature Selection. Preprint at https://doi.org/10.48550/ARXIV.2409.19209 (2024). Foppa, L., Purcell, T. A. R., Levchenko, S. V., Scheffler, M. & Ghiringhelli, L. M. Hierarchical Symbolic Regression for Identifying Key Physical Parameters Correlated with Bulk Properties of Perovskites. Phys. Rev. Lett. 129 , 055301 (2022). Chen, T. & Guestrin, C. XGBoost: A Scalable Tree Boosting System. in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 785–794 (ACM, San Francisco California USA, 2016). doi:10.1145/2939672.2939785. Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. Chem. Mater. 31 , 3564–3572 (2019). Purcell, T. A. R., Scheffler, M. & Ghiringhelli, L. M. Recent advances in the SISSO method and their implementation in the SISSO++ code. The Journal of Chemical Physics 159 , 114110 (2023). Additional Declarations No competing interests reported. Supplementary Files SIMatterVialNPJ2025REV.docx Cite Share Download PDF Status: Published Journal Publication published 15 Jan, 2026 Read the published version in npj Computational Materials → Version 1 posted Editorial decision: Revision requested 10 Oct, 2025 Reviews received at journal 01 Oct, 2025 Reviews received at journal 25 Sep, 2025 Reviewers agreed at journal 15 Sep, 2025 Reviewers agreed at journal 10 Sep, 2025 Reviewers agreed at journal 08 Sep, 2025 Reviewers invited by journal 08 Sep, 2025 Editor assigned by journal 08 Sep, 2025 Submission checks completed at journal 06 Sep, 2025 First submitted to journal 02 Sep, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7518209","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":513241909,"identity":"7a5e5799-bf8d-478c-8e61-64bebcd85991","order_by":0,"name":"Rogério Almeida Gouvêa","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5klEQVRIiWNgGAWjYNACAzDJ+ABI8PCRooUZRPGwkWIXmwSYJKTMvL394uOKAhu7BvbutMqvOXYybAzMDx/dwKNF5syZYsMzBmnJDTxnt92W3ZYMdBibsXEOHi0SEjlpkg0Gh5MZJHK33ZbcxgzUwsMmTUBL+k+wFvm324olt9UToyX9GCNQix2DBO82xo/bDhOhhecMM9BhaQlsPLmbpRm3HedhYybkF/b2hx8b/tjY87Of3fjx57ZqIKP54WN8WoBxB47HxDYgwcwDYjLjVQ4C7A9ApD2IYPxBUPUoGAWjYBSMRAAANcJAz2Xj3BQAAAAASUVORK5CYII=","orcid":"","institution":"Federal University of Rio Grande do Sul","correspondingAuthor":true,"prefix":"","firstName":"Rogério","middleName":"Almeida","lastName":"Gouvêa","suffix":""},{"id":513241910,"identity":"647f859a-f0be-4cff-b253-3cb9737c7b1f","order_by":1,"name":"Pierre-Paul de Breuck","email":"","orcid":"","institution":"Université Catholique de Louvain","correspondingAuthor":false,"prefix":"","firstName":"Pierre-Paul","middleName":"","lastName":"de Breuck","suffix":""},{"id":513241911,"identity":"7069c494-e59e-47c8-91f0-1b0f0176c478","order_by":2,"name":"Tatiane Pretto","email":"","orcid":"","institution":"Federal University of Rio Grande do Sul","correspondingAuthor":false,"prefix":"","firstName":"Tatiane","middleName":"","lastName":"Pretto","suffix":""},{"id":513241912,"identity":"a39324ec-ca02-4f5e-a399-37dafd158ceb","order_by":3,"name":"Gian-Marco Rignanese","email":"","orcid":"","institution":"Université Catholique de Louvain","correspondingAuthor":false,"prefix":"","firstName":"Gian-Marco","middleName":"","lastName":"Rignanese","suffix":""},{"id":513241913,"identity":"690db3d6-4ed8-48d6-aa0e-3db7e8631601","order_by":4,"name":"Marcos José Leite Santos","email":"","orcid":"","institution":"Federal University of Rio Grande do Sul","correspondingAuthor":false,"prefix":"","firstName":"Marcos","middleName":"José Leite","lastName":"Santos","suffix":""}],"badges":[],"createdAt":"2025-09-02 13:23:47","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7518209/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7518209/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1038/s41524-025-01938-2","type":"published","date":"2026-01-15T16:29:19+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":91817153,"identity":"40816071-0dc9-44e9-9db6-fe583f6d392d","added_by":"auto","created_at":"2025-09-22 06:53:43","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5887062,"visible":true,"origin":"","legend":"","description":"","filename":"PaperMatterVialNPJ2025REV.docx","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/a19e7c4d3b2e386138530bf2.docx"},{"id":91817123,"identity":"9304f5ad-b457-4635-9a0d-0be0aed0a4f3","added_by":"auto","created_at":"2025-09-22 06:53:37","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2967889,"visible":true,"origin":"","legend":"","description":"","filename":"FIG3shapbeeswarmwithequations.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/007aaba943ce7c37110d0ace.jpeg"},{"id":91816936,"identity":"a1b64bfe-f89a-4dc2-aefa-d22469cc001b","added_by":"auto","created_at":"2025-09-22 06:53:02","extension":"json","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":6771,"visible":true,"origin":"","legend":"","description":"","filename":"49e5b492f77a4f9b9de13dc9ccce06cb.json","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/e3dfeb464546daed29d2c629.json"},{"id":91816961,"identity":"6256163c-7d95-46d1-8646-bf8770691c73","added_by":"auto","created_at":"2025-09-22 06:53:07","extension":"docx","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2980604,"visible":true,"origin":"","legend":"","description":"","filename":"SIMatterVialNPJ2025REV.docx","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/15d6e2bb219e91dbdc28cd06.docx"},{"id":91816957,"identity":"d6495e8e-b131-44f0-bca4-adcb05bfa692","added_by":"auto","created_at":"2025-09-22 06:53:05","extension":"xml","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":175203,"visible":true,"origin":"","legend":"","description":"","filename":"49e5b492f77a4f9b9de13dc9ccce06cb1enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/bd9f2018418a65764aff692d.xml"},{"id":91816877,"identity":"b65227d9-f54e-4bd2-b9c3-8d73898f5faa","added_by":"auto","created_at":"2025-09-22 06:52:52","extension":"jpeg","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":4448278,"visible":true,"origin":"","legend":"","description":"","filename":"FIG1MVillustration.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/bccd4ebfc1bc3508555f0552.jpeg"},{"id":91816992,"identity":"fbaf4441-911c-4f01-be31-18e9bf03716c","added_by":"auto","created_at":"2025-09-22 06:53:16","extension":"jpeg","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2967889,"visible":true,"origin":"","legend":"","description":"","filename":"FIG3shapbeeswarmwithequations.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/bfe93bcf8a179de300d8bde3.jpeg"},{"id":91817062,"identity":"f357559c-4d75-42b8-886f-44cb8746cc76","added_by":"auto","created_at":"2025-09-22 06:53:26","extension":"jpeg","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2499762,"visible":true,"origin":"","legend":"","description":"","filename":"FIG4sissomaecomparisonandtsneaverage.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/b66a978da2fd36ecd7013452.jpeg"},{"id":91816979,"identity":"b3670582-eeb3-4512-a6f1-0b05c6a028a9","added_by":"auto","created_at":"2025-09-22 06:53:11","extension":"png","order_by":17,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":2592255,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/8bb100580ff7321ee3db6d6d.png"},{"id":91816915,"identity":"e02e75f4-2db8-4c5a-95e7-c8d802bd56ed","added_by":"auto","created_at":"2025-09-22 06:52:58","extension":"png","order_by":19,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":662981,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/6a5231340083df27bd5d1652.png"},{"id":91817166,"identity":"d0e2a958-e581-4f22-a309-31f22c18eed6","added_by":"auto","created_at":"2025-09-22 06:53:47","extension":"png","order_by":20,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":888050,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/80d87de37cb2e2eea47fa6ee.png"},{"id":91816895,"identity":"d482b9b5-ab04-4b9e-b717-6c0fe62973ed","added_by":"auto","created_at":"2025-09-22 06:52:54","extension":"png","order_by":21,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1272358,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFIG1MVillustration.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/0af9f62739f6b30a83352819.png"},{"id":91817162,"identity":"67b6f420-6072-4b75-937a-3b93ab12f1b8","added_by":"auto","created_at":"2025-09-22 06:53:46","extension":"png","order_by":23,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":615823,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFIG3shapbeeswarmwithequations.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/2253c58ea6cad0b2ad30b736.png"},{"id":91816810,"identity":"152bfba9-e710-4bff-9489-763b2b28d450","added_by":"auto","created_at":"2025-09-22 06:52:46","extension":"png","order_by":25,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":549485,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFIG5explainingtheXGBmodels.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/1a9ccf9beb4bfcb95e6eca4f.png"},{"id":91816696,"identity":"f0e643e9-60bf-44f2-a198-f8d40f2716d5","added_by":"auto","created_at":"2025-09-22 06:52:38","extension":"png","order_by":27,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":349694,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/3a277af065611c0a8f4579e3.png"},{"id":91489339,"identity":"48da69b7-bdcb-4558-ae8e-d12332525cf0","added_by":"auto","created_at":"2025-09-17 05:13:27","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":324413,"visible":true,"origin":"","legend":"\u003cp\u003eOverview of the methodology for leveraging latent-space features from GNN models with MatterVial. On the left (I), the generation and deployment of descriptor-oriented GNN models are illustrated. At the center (II), task-oriented GNN models—either pretrained or trained on the fly for adjacent variants—are shown, with feature extraction possible from activation, pooling, or multi-layer perceptron (MLP) layers on the model architecture. On the right (III), formulas from symbolic regression from SISSO are also implemented, leveraging traditional physiochemical descriptors available in MatMiner as a base.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/60eec9307f123b5f1729280a.png"},{"id":91816620,"identity":"9343a507-7fd8-4072-8c29-f24f667c06ee","added_by":"auto","created_at":"2025-09-22 06:52:14","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":477678,"visible":true,"origin":"","legend":"\u003cp\u003eThe impact of different MatterVial feature sets on the performance of MatBench’s perovskite heat of formation task is illustrated with three different types of plots for each of the three models: on the left, MODNet@MV(no ORB), where features from the ORB model are excluded; at the center, MODNet@MV, which includes all pretrained MatterVial features; and on the right, MODNet@MV+Adj(coGN), which further incorporates features from an adjacent coGN model trained on the task. (a–c) Bar plots showing the feature importance aggregated by a group of features through the sum of the mean absolute SHAP values. (d–f) t-SNE projections of the SHAP values for each feature in the model, colored by the feature group and some of the features with the highest contribution annotated. (g–i) t-SNE projections of the top 10 most important features colored by the target value (heat of formation).\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/6f182aa0b0b8f248d6b038e7.png"},{"id":91817018,"identity":"bd5e9ba1-1c8f-4d85-88a9-15d89302a53d","added_by":"auto","created_at":"2025-09-22 06:53:17","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":323498,"visible":true,"origin":"","legend":"\u003cp\u003eSHAP values plot for selected MatterVial features in the MODNet@MV(no ORB) model for the perovskite heat of formation. The plot displays the impact of individual features on the model's output (SHAP value), with the color indicating the feature's value (blue for low, red for high). Alongside each feature, the corresponding 1-term SISSO formula approximation for the MatterVial features and its R² value, when appropriate, are shown.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/c067dfe6e7812a845cf10b1e.png"},{"id":91489338,"identity":"246ebe04-63c3-418e-98b7-eba41a27f977","added_by":"auto","created_at":"2025-09-17 05:13:27","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":276971,"visible":true,"origin":"","legend":"\u003cp\u003eSISSO models and hiSISSO-enhanced MODNet model analysis on matbench_perovskites task. (a) Mean absolute error of models with baseline MatMiner+OFM features (orange) vs. those augmented with SISSO formulas approximating the best GNN features (dark green). Feature selection methods include i-SISSO, rf-SISSO, and xgb-rfe-SISSO; (b) t-SNE projection of SHAP values for top feature groups in the final MODNet@MV+adj(coGN)+hiSISSO model; point size reflects feature impact. (c) Average feature importance across main classes in the final model, calculated from the mean absolute SHAP values.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/6c2772597fbbd1c2eec4f4f7.png"},{"id":100614525,"identity":"d260de32-aa4f-47a4-8d81-e1afc9b501c4","added_by":"auto","created_at":"2026-01-19 17:21:36","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2418986,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/e4c89580-084d-4ab9-a7a7-3d95baaae5ec.pdf"},{"id":91489348,"identity":"4b781ffc-49b4-45e3-8ed7-4c73f255ed75","added_by":"auto","created_at":"2025-09-17 05:13:28","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":2980604,"visible":true,"origin":"","legend":"","description":"","filename":"SIMatterVialNPJ2025REV.docx","url":"https://assets-eu.researchsquare.com/files/rs-7518209/v1/2b22f75fa5aa536c6fc25b0c.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability","fulltext":[{"header":"Introduction","content":"\u003cp\u003eMachine learning has revolutionized materials science, accelerating material discovery and property optimization across various domains\u003csup\u003e\u003cspan additionalcitationids=\"CR2\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e\u003c/sup\u003e. The two prominent approaches in this field are feature-based and graph-neural-network (GNN) models, each with distinct advantages and limitations\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e. Feature-based models rely on predefined descriptors such as elemental properties, geometric features, and electronic structure information. They are highly interpretable and effective with small datasets, offering insights into structure-property relationships\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e. These models adapt well to custom tasks in experimental settings, such as nanocrystal research\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e, catalysis\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e, and organic photovoltaics\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e. In contrast, GNN models represent materials as graphs, capturing structural information through message passing and learning deep representations with simple atomic descriptors. This often results in more accurate predictions for complex materials, but requires greater computational resources and data for training\u003csup\u003e\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. GNNs are particularly effective in the large-scale screening of materials and for constructing interatomic potentials owing to their efficient computation and local information aggregation,\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e however they lack interpretability.\u003c/p\u003e\u003cp\u003eBoosting the accuracy of feature-based models to make them competitive on larger datasets usually implies employing neural network models and relying on extensive suites, such as MatMiner\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e, to produce meaningful features. This process is particularly time-consuming for sophisticated descriptors like the Orbital Field Matrix (OFM)\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e and the Smooth Overlap of Atomic Positions (SOAP)\u003csup\u003e\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. A novel strategy to boost these feature-based models involves leveraging the rich latent-space representations learned by GNN models pretrained on vast datasets. Even though neural networks are universal function approximators, easing their burden through well-aligned feature transformations can improve generalization, reduce training time, and stabilize convergence\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003eIn this work, we address these challenges by proposing a hybrid approach that combines traditional chemically intuitive descriptors with latent features obtained from a diverse set of pretrained models. We incorporate features from both structure-based (MEGNet, coGN)\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e and composition-based (ROOST)\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e GNNs, as well as from ORB\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e, a powerful equivariant Machine Learning Interatomic Potential (MLIP). To avoid the featurization bottleneck of traditional descriptors, we also leverage GNNs to generate fast, latent-space approximations of MatMiner (ℓ-MM) and Orbital Field Matrix (ℓ-OFM) features. Finally, we augment this feature set with new descriptors derived via symbolic regression. This multifaceted strategy aims to create a more robust, accurate, and versatile featurizer that capitalizes on the distinct strengths of each approach to be useful for a wider range of dataset sizes.\u003c/p\u003e\u003cp\u003eTo simplify the generation of all those features, a package was developed named MatterVial standing for \u003cb\u003eMAT\u003c/b\u003eerials fea\u003cb\u003eT\u003c/b\u003eu\u003cb\u003eR\u003c/b\u003ee \u003cb\u003eE\u003c/b\u003extraction \u003cb\u003eV\u003c/b\u003eia \u003cb\u003eI\u003c/b\u003enterpretable \u003cb\u003eA\u003c/b\u003ertificial \u003cb\u003eL\u003c/b\u003eearning, which, besides producing all latent-space features from the GNN models, aids in obtaining the interpretable chemical descriptors that correlate to these high-level features. This is achieved through techniques such as SHapley Additive exPlanations (SHAP) analysis in surrogate models and symbolic regression via Sure Independence Screening and Sparsifying Operator (SISSO) to obtain an approximate formula from the most important features. Our results demonstrate an overall improvement in all analyzed datasets compared with the baseline MatMiner featurizer. In addition, it surpassed the performance of the individual GNN models in several cases, indicating that the combination of traditional and latent-space features leads to a more robust generalization.\u003c/p\u003e\u003cp\u003eThis work is situated within a recent methodological trend that repurposes GNNs not as end-to-end predictors, but as powerful and data-efficient feature generators for a variety of downstream tasks\u003csup\u003e\u003cspan additionalcitationids=\"CR23 CR24 CR25\" citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e. Our approach bridges feature-based and graph-based methods, leveraging their strengths to develop more versatile and task-agnostic machine learning models in materials science. By enhancing the accuracy, efficiency, and interpretability of property prediction, this framework facilitates the integration of both experimental and simulated data. Moreover, it aligns with the growing demand for explainable AI\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e,\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e, which is essential for the advancement of self-driving laboratories in materials discovery and optimization\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"Results and discussion","content":"\u003cp\u003eWe evaluate our approach using the full MatBench v0.1 benchmark\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e with MODNet\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e, which is the state-of-the-art feature-based model in materials science\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e. We adopt the same MatMiner featurization as that used in MODNet for MatBench in the original publication\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. These can be complemented by three categories of MatterVial features, as illustrated in Fig. \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eI. Latent-space features from descriptor-oriented GNNs\u003c/strong\u003e: Conventional material descriptors are transformed into latent representations using an autoencoder trained on Materials Project (MP) data. These descriptors include the widely used MatMiner features (ℓ-MM) and the features from the Orbital Field Matrix featurizer (ℓ-OFM). A GNN was then trained to replicate these latent features directly from the input structures. This method achieves a computational efficiency similar to that of GNNs and still preserves interpretability via decoding.\u003c/p\u003e\n\u003cp\u003e\u003cspan\u003e\u003c/span\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eII. Latent-space features from task-oriented GNNs\u003c/strong\u003e: These features are extracted directly from the intermediate layers of pretrained GNN models that have been developed for various tasks. Specifically, we incorporate MEGNet models from the Materials Virtual Lab (MVL) that were pretrained for the prediction of elastic constants, band gap, and formation energy, as well as for the metal-insulator classification. We also consider composition-based ROOST models for the band gap and formation energy. In addition, we include the internal layers of ORB-v3, a state-of-the-art equivariant MLIP trained to reproduce energies and forces. This group capitalizes on the strengths of GNN architectures in capturing complex structural representations, aiming to enhance predictive performance on larger datasets.\u003c/p\u003e\u003cspan\u003e\n \u003cp\u003e\u003cstrong\u003eIII. Symbolically-Derived Feature Combinations\u003c/strong\u003e: Here, we use the MatMiner features as a basis to generate new compound features. Through symbolic regression with SISSO, we identify several combinations of pairs of features (rung one) that exhibit enhanced correlations with the target properties of interest in materials science. These derived formulas are then incorporated as new features.\u003c/p\u003e\n\u003c/span\u003e\n\u003cp\u003e\u003c/p\u003e\n\u003cp\u003eSince the features obtained from task-oriented GNNs are high-level and not directly interpretable as traditional descriptors, we develop a method to decompose them into interpretable descriptors, which is integrated in the Interpreter module in MatterVial. In addition, features from descriptor-oriented GNNs can be decoded in their interpretable counterparts. Equally, the third group of augmented features via symbolic regression can have their formulas retrieved by name. Comprehensive implementation details for each category and for the Interpreter module are available in the Methods section and in the \u003cem\u003eSupplementary Information\u003c/em\u003e.\u003c/p\u003e\n\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003eMatBench validation of MatterVial features\u003c/h2\u003e\n \u003cp\u003eTable \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e presents the performance of MODNet using MatMiner augmented with MatterVial descriptors (MODNet@MM\u0026thinsp;+\u0026thinsp;MV) and MODNet using only MatterVial descriptors (MODNet@MV) relative to the baseline model using only MatMiner features (MODNet@MM) in the 13 MatBench tasks. The results show that blending both latent-space representations from task-oriented and descriptor-oriented GNNs with symbolically derived features consistently reduces prediction errors across this diverse array of property prediction tasks.\u003c/p\u003e\n \u003cp\u003eOur approach significantly improves the performance on smaller datasets, where feature-based models have traditionally outperformed GNNs. Specifically, our models set new performance records for four tasks previously led by MODNet@MM and now achieve a leading performance in metallicity classification from experimental data. Notably, the glass-forming ability task alone did not result in substantial improvements. We highlight that for smaller composition-based datasets, MatMiner featurization is sufficiently fast to make MODNet@MM\u0026thinsp;+\u0026thinsp;MV computationally effective. For larger datasets, in which traditional featurization is very time consuming, our MODNet@MV models significantly bridge the gap between feature-based and graph-based models, even outperforming state-of-the-art (SOTA) models in predicting properties such as elastic constants, band gap, metallicity, and formation energy. This success demonstrates that our approach effectively addresses the common shortcomings of both feature- and graph-based models. Note, however, that some of the larger MatBench tasks can no longer be considered truly independent test sets for models exposed to vast amounts of similar ab initio data during pretraining.\u003c/p\u003e\n \u003cdiv class=\"gridtable\"\u003e\n \u003ctable id=\"Tab1\" border=\"1\"\u003e\n \u003ccaption language=\"En\"\u003e\n \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n \u003cdiv class=\"CaptionContent\"\u003e\n \u003cp\u003ePerformance comparison of three MODNet variants against the best multi-purpose MatBench model on each task in the MatBench v0.1 benchmark. Metrics are reported as mean absolute error (MAE) for regression and area under the receiver-operator curve (AUROC) for classification. MODNet@MM uses only MatMiner features; MODNet@MM\u0026thinsp;+\u0026thinsp;MV augments these features with MatterVial descriptors; and MODNet@MV uses only MatterVial features essentially substituting MatMiner features by ℓ-MM. For each task, the MatterVial feature group that yields the best result is shown. Scores in bold identify the overall best model per task, and shaded tasks are those in which MODNet was already the best model.\u003c/p\u003e\n \u003c/div\u003e\n \u003c/caption\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMatBench task\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:n\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMODNet@\u003c/p\u003e\n \u003cp\u003eMM\u003c/p\u003e\n \u003cp\u003e(baseline)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMODNet@\u003c/p\u003e\n \u003cp\u003eMM\u0026thinsp;+\u0026thinsp;MV\u003c/p\u003e\n \u003cp\u003e(% error reduction*)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMODNet@\u003c/p\u003e\n \u003cp\u003eMV\u003c/p\u003e\n \u003cp\u003e(% error reduction*)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eMatBench record\u003c/p\u003e\n \u003cp\u003e(model)\u003c/p\u003e\n \u003c/th\u003e\n \u003cth align=\"left\"\u003e\n \u003cp\u003eBest MatterVial groups**\u003c/p\u003e\n \u003c/th\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eSteels yield strength (MPa) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:312\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:87.76\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:85.12\\:\\left(3.0\\mathbf{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e120.95\u003c/p\u003e\n \u003cp\u003e(-37.8%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMODNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eROOST\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eE\u003csub\u003eexfol\u003c/sub\u003e. (meV/atom)\u003csup\u003e33\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:636\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:33.19\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:29.19\\:\\left(12.1\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e28.86 (13.0%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMODNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eORB, MVL, ℓ-OFM, ℓ-MM\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eargmax(PhDOS) (cm\u003csup\u003e-\u003cspan class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{1,265}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:34.27\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:30.08\\)\u003c/span\u003e\u003c/span\u003e \u003cstrong\u003e(12.2%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e30.58 (10.8%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:28.76\\)\u003c/span\u003e\u003c/span\u003e (MegNet)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMVL, ORB, ℓ-OFM, ROOST, SISSO\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eExp. band gap (eV) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{4,604}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.333\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.290\\:\\left(12.9\\mathbf{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.351 (-5.5%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMODNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eROOST, SISSO\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eRefractive index \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e36\u003c/span\u003e,\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{4,764}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.271\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.235\\:\\)\u003c/span\u003e\u003c/span\u003e(13.3%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.234 (13.7%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMODNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eORB, ℓ-OFM, MVL, ℓ-MM\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eExp. metallicity (eV) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{4,921}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.916\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.976\\:\\left(71.4\\mathbf{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.898 (-59.3%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.921\\)\u003c/span\u003e\u003c/span\u003e (AMMExpress)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eROOST\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eGlass-forming ability\u003c/p\u003e\n \u003cp\u003e\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e38\u003c/span\u003e,\u003cspan class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{5,680}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.936\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\left(0.960\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.937\\)\u003c/span\u003e\u003c/span\u003e (1.6%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.904 (-50.0%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMODNet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eROOST\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLogarithmic G\u003csup\u003evrh\u003c/sup\u003e\u003c/p\u003e\n \u003cp\u003e(log\u003csub\u003e10\u003c/sub\u003eGPa) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{10,987}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.073\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.032\\:\\left(55.5\\mathbf{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.033\\:\\left(54.8\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.067\\)\u003c/span\u003e\u003c/span\u003e (coGN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMVL, ORB, ℓ-MM, ROOST, SISSO\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eLogarithmic K\u003csup\u003evrh\u003c/sup\u003e\u003c/p\u003e\n \u003cp\u003e(log\u003csub\u003e10\u003c/sub\u003eGPa) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{10,987}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.056\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.027\\:\\left(49.6\\mathbf{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.028\\:\\left(50.1\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.049\\)\u003c/span\u003e\u003c/span\u003e (coNGN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMVL, ORB, ℓ-MM, ℓ-OFM, ROOST, SISSO\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003ePerovskite \u0026Delta;H\u003csub\u003eform\u003c/sub\u003e\u003c/p\u003e\n \u003cp\u003e(eV/unitcell)\u003csup\u003e41\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{18,928}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0908\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0386\\:\\left(57.5\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0389\\:\\left(57.3\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0269\\)\u003c/span\u003e\u003c/span\u003e (coGN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eORB, MVL, ℓ-OFM, ℓ-MM, ROOST, SISSO\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eBand gap (eV) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{106,113}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.2199\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.137\\:\\left(37.6\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.137 (37.8%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.156\\)\u003c/span\u003e\u003c/span\u003e (coGN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMVL, ORB, ROOST, SISSO\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMetallicity \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{106,113}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.904\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.978\\:\\left(77.1\\mathbf{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e0.976 \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\left(75.0\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.9520\\)\u003c/span\u003e\u003c/span\u003e (CGCNN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eORB, MVL, ℓ-OFM, ROOST\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eE\u003csub\u003ef\u003c/sub\u003e (eV/atom) \u003csup\u003e\u003cspan class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{132,752}\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0448\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0147\\:\\left(67.2\\text{\\%}\\right)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.0138\u003c/strong\u003e (69.2%)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0170\\)\u003c/span\u003e\u003c/span\u003e (coGN)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd align=\"left\"\u003e\n \u003cp\u003eMVL, ORB, ℓ-OFM, ℓ-MM\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd align=\"left\" colspan=\"7\"\u003e\n \u003cp\u003e* \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\%\\:\\text{error:reduction:}=\\frac{MA{E}_{baseline}-MA{E}_{model}}{MA{E}_{baseline}}\\times\\:100\\%\\)\u003c/span\u003e\u003c/span\u003e \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{(regression)}\\)\u003c/span\u003e\u003c/span\u003e \u003cem\u003eor\u003c/em\u003e \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\frac{\\left(1-AU{ROC}_{baseline}\\right)-\\left(1-AURO{C}_{model}\\right)}{\\left(1-AURO{C}_{baseline}\\right)}\\times\\:100\\%\\)\u003c/span\u003e\u003c/span\u003e (classification)\u003c/p\u003e\n \u003cp\u003e** Ordered by importance, MVL, ORB and ROOST refer to the task-oriented GNN features, respectively those from MVL MEGNet models for structures, the MLIP Orb-v3 and pretrained ROOST models for compositions. ℓ-MM and ℓ-OFM refer to the descriptor-oriented GNN features, ℓ-MM when included, substitutes the MatMiner features for faster generation. SISSO refers to the group of features derived from MM features via symbolic regression.\u003c/p\u003e\n \u003cp\u003e\u003csup\u003e\u0026dagger;\u003c/sup\u003e As we were unable to replicate the reported 0.960 AUROC for glass formability using MODNet, we present our best MODNet@MM result as baseline instead. Despite the lower score, MODNet continues to outperform other models in MatBench for this task.\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n \u003c/div\u003e\n \u003cp\u003eAn analysis of the feature contributions in Table \u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e reveals that task-oriented latent features are the primary drivers of performance gains. The inclusion of ROOST aimed at enhancing performance in composition-based tasks, and yet the model has reliably improved results in a wide range of tasks that also contained structural information. This performance may be attributed to the attention mechanism that captures unique patterns during activation and material pooling. For structure-based tasks, MVL-derived features have shown a significant positive impact. They boost predictions even when the prediction targets differ from those used in the original models, such as in predicting the perovskite heat of formation and refractive index. The ORB features, derived from an equivariant MLIP, proved particularly impactful, frequently appearing as top contributors. This is chemically intuitive, as the model\u0026apos;s training on energies and forces provides a rich, physically meaningful latent space that is useful for transfer learning. This aligns with very recent findings by Kim et al.\u003csup\u003e\u003cspan class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e, who also employed ORB features with MODNet for structure-based regression tasks. Our approach achieves enhanced performance by incorporating all Orb-v3 layers and combining these features with diverse descriptor groups within our framework.\u003c/p\u003e\n \u003cp\u003eThe descriptor-oriented and symbolically derived features also provided consistent complementary improvements. The ℓ-OFM features improved performance across most tasks, validating that our GNN-based approximation is an efficient and effective method for incorporating the descriptive power of computationally expensive descriptors like the Orbital Field Matrix. The ℓ-MM features, designed as a shortcut for MatMiner features via GNN, lead to improved or similar performance on many tasks. Compared to the models that used the full MatMiner features (MODNet@MM\u0026thinsp;+\u0026thinsp;MV), we argue that the reconstruction loss was sufficiently low and that, for some cases, the encoder effectively refined the representation via regularization, improving the metrics. Crucially, these latent-space representations remain decodable, preserving much of the interpretability, which is a hallmark of feature-based models. Finally, the SISSO-derived features, while less universally impactful, still boosted performance in roughly half of the benchmarks. Given that we utilized only first-rung symbolic regression, we conjecture that there is clear potential for further gains with higher-level, more complex formulas. Ultimately, these results show that our approach simultaneously accelerates featurization, improves model performance, and provides valuable chemical insights. This combination of benefits repositions feature-based models as strong and practical alternatives to end-to-end GNNs for property prediction.\u003c/p\u003e\n\u003c/div\u003e\n\u003ch3\u003eSynergy of MatterVial features and adjacent GNN model\u003c/h3\u003e\n\u003cp\u003eHaving demonstrated the performance gains of our method, we now turn to the individual contributions of the MatterVial features. We examine the synergetic effects of each MatterVial feature group using the perovskite heat of formation task as an example. Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e illustrates a step-by-step performance evaluation for this task, revealing how the integration of different MatterVial feature groups leads to cumulative improvements. Starting from our baseline, the MODNet@MM model delivers an MAE of 0.0888 eV/unit cell. This performance serves as a reference point against which the benefits of the additional features can be measured.\u003c/p\u003e\u003cp\u003eThe first modification involves introducing descriptor-oriented GNN features, ℓ-OFM and ℓ-MM, which are designed to be computationally faster approximations of their full counterparts. When MatMiner features are entirely replaced by their latent representation (MODNet@ℓ-MM), the MAE is 0.1052 eV/unit cell. While higher than our MODNet@MM baseline (0.0888 eV/unit cell), this still significantly outperforms AutoMatMiner (0.2005 eV/unit cell), demonstrating ℓ-MM as a viable, faster featurization alternative. Augmenting MatMiner with ℓ-OFM (MODNet@MM+ℓ-OFM) reduces the MAE to 0.0794 eV/unit cell. This is lower than the baseline, although still higher than that obtained using the original computationally intensive OFM features (MODNet@MM\u0026thinsp;+\u0026thinsp;OFM, 0.0751 eV/unit cell). Combining both ℓ-MM and ℓ-OFM (MODNet@ℓ-MM+ℓ-OFM) yields an MAE of 0.0973 eV/unit cell. These results highlight that our proxy GNN featurizers offer a compelling trade-off, capturing essential chemical information with a substantial speed-up in featurization.\u003c/p\u003e\u003cp\u003eBuilding on this foundation, the incorporation of task-oriented GNN features from the MVL pretrained models further boosts performance in MODNet@ℓMM+ℓOFM\u0026thinsp;+\u0026thinsp;MVL model, lowering the MAE to 0.0673 eV/unit cell. Clearly, the MVL descriptors capture additional structural and physicochemical details that the MM and OFM features do not, thereby enhancing the ability of the model to predict heat formation (more details on the MVL descriptions and the effect of different layers are given in the \u003cem\u003eSupplementary Information\u003c/em\u003e, section S6).\u003c/p\u003e\u003cp\u003eNext, the addition of symbolically derived feature combinations via SISSO produces modest refinement, reducing the MAE to 0.0653 eV/unit cell. Although the improvement is small, it underscores the notion that simple algebraic combinations of conventional descriptors can reveal non-linear relationships, complement the latent-space features, and thereby enhance prediction accuracy.\u003c/p\u003e\u003cp\u003eFurther refinement is achieved by incorporating composition-based ROOST features. At first glance, one might not expect an improvement over the MEGNet MVL models since they incorporate structural information alongside composition. However, we believe that the attention-based mechanism present in ROOST is responsible for capturing additional meaningful information to complement other feature groups and achieve an MAE of 0.0639 eV/unit cell. Furthermore, at this point, using the standard MatMiner features instead of their latent representation (ℓMM) yields a nearly equivalent performance (MAE of 0.0637 eV/unit cell). These results confirm that the rapidly generated encoded representations can effectively replace the full MatMiner features in tandem with other descriptors. However, eliminating MatMiner features entirely (neither MM nor ℓ-MM), causes a significant decrease in accuracy with 0.0707 eV/unit cell in MODNet@MVL\u0026thinsp;+\u0026thinsp;ROOST and 0.0716 eV/unit cell in MODNet@MVL, indicating that the MatMiner features are valuable and not simply redundant to these GNN descriptors. In fact, a synergistic effect among all MatterVial feature groups is observed in this dataset.\u003c/p\u003e\u003cp\u003eORB features stand apart from other featurizers like MVL and ROOST. While MVL and ROOST were trained on smaller datasets, specifically MP and OQMD\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e (about 1.5\u0026nbsp;million structures combined), the ORB-v3 featurizer was trained on a significantly larger dataset. This dataset, which combines MP, Alexandria\u003csup\u003e\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e, and OMat\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e, leverages approximately 120\u0026nbsp;million calculated structures, a number at least two orders of magnitude larger than either those datasets. The extraction of features from this model in MatterVial to use in MODNet significantly reduces the mean absolute error in the task, but a slight improvement is still seen with the other MatterVial features that were included. We conjecture that larger reductions might still be achievable by training more task- and descriptor-oriented models in these larger datasets.\u003c/p\u003e\u003cp\u003eDespite the significant reduction, feature-based approaches using pretrained models with MatterVial or HackNIP still fall short of the results obtained purely with GNNs such as MEGNet and coGN trained in the perovskites dataset. Based on this observation, we incorporate into MatterVial the possibility of training adjacent GNN models on the fly and extracting their features with the AdjacentGNNFeaturizer class. We achieve 0.0343 eV/unit cell using the MEGNet adjacent model features. The MEGNet benchmarked MAE is substantially lower than what we achieved using the default configuration of the model, even with the same elemental embeddings provided by the authors. This discrepancy is possibly due to differences in hyperparameters, inclusion of additional features, and larger training times employed for the benchmark\u003csup\u003e\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e. Finally, we employed the SOTA coGN model as an adjacent model for feature extraction, and we obtained comparable results to the reported values in MatBench with this model. Incorporating coGN features in our MODNet model reduced the MAE to 0.0313 eV/unit cell, which is much closer to the 0.0269 eV/unit cell record.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eMean absolute errors (MAEs) for the MatBench task of the heat of formation of perovskites with different models.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eReference models\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eMAE\u003c/p\u003e\u003cp\u003e(eV/unit cell)\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMatterVial models\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eMAE\u003c/p\u003e\u003cp\u003e(eV/unit cell)\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cem\u003eDescriptor-oriented\u003c/em\u003e\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAutoMatMiner (MatBench*)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.2005\\:(\\pm\\:0.0085)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@ℓ-MM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.1052\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0022)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMODNet@MM (this work)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0888\\:(\\pm\\:0.0025)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@MM+ℓ-OFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.0794\u003c/p\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0016)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMODNet@MM\u0026thinsp;+\u0026thinsp;OFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.0751 \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0888\\)\u003c/span\u003e\u003c/span\u003e \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0018)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@ℓ-MM+ℓ-OFM\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.0973\u003c/p\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0016)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eTask-oriented (MVL, ROOST)\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMODNet@MVL\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.0716\u003c/p\u003e\u003cp\u003e(\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\pm\\:\\)\u003c/span\u003e\u003c/span\u003e0.0020)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet\u003c/p\u003e\u003cp\u003e@ℓ-MM+ℓ-OFM\u0026thinsp;+\u0026thinsp;MVL\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.0673 (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\pm\\:\\)\u003c/span\u003e\u003c/span\u003e0.0015)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMODNet@MVL\u0026thinsp;+\u0026thinsp;ROOST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.0707\u003c/p\u003e\u003cp\u003e(\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\pm\\:\\)\u003c/span\u003e\u003c/span\u003e0.0017)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@ℓ-MM+ℓ-OFM+\u003c/p\u003e\u003cp\u003e+MVL\u0026thinsp;+\u0026thinsp;SISSO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.0653\u003c/p\u003e\u003cp\u003e(\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\pm\\:\\)\u003c/span\u003e\u003c/span\u003e0.0013)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMODNet@MM+ℓ-OFM\u0026thinsp;+\u0026thinsp;+\u0026thinsp;MVL\u0026thinsp;+\u0026thinsp;SISSO\u0026thinsp;+\u0026thinsp;ROOST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.0637 (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\pm\\:\\)\u003c/span\u003e\u003c/span\u003e0.001)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@ℓ-MM+ℓ-OFM\u0026thinsp;+\u0026thinsp;+\u0026thinsp;MVL\u0026thinsp;+\u0026thinsp;SISSO\u0026thinsp;+\u0026thinsp;ROOST\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.0639 (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\pm\\:\\)\u003c/span\u003e\u003c/span\u003e0.0010)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eTask-oriented (ORB featurizer)\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMODNet@MM+ℓ-OFM+\u003c/p\u003e\u003cp\u003e+MVL\u0026thinsp;+\u0026thinsp;SISSO\u0026thinsp;+\u0026thinsp;ROOST\u0026thinsp;+\u0026thinsp;ORB\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0386\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0009\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eHackNIP\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e (MODNet@ORB)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.0397\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@MV \u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0388\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0006\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003e\u003cb\u003eMV\u0026thinsp;+\u0026thinsp;Adjacent GNN model\u003c/b\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMEGNet (MatBench*)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0352\\:(\\pm\\:0.0016)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eMEGNet (this work)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0685\\:(\\pm\\:0.0036)\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@\u003c/p\u003e\u003cp\u003eMV\u0026thinsp;+\u0026thinsp;Adj(MEGNet)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0343\\)\u003c/span\u003e\u003c/span\u003e\u003c/p\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0014\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ecoGN (MatBench*)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0269\\:(\\pm\\:0.0008\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@\u003c/p\u003e\u003cp\u003eMV\u0026thinsp;+\u0026thinsp;Adj(coGN)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0313\\:(\\pm\\:0.0012\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003ecoGN (this work)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:0.0271\\:(\\pm\\:0.0008\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003eMODNet@\u003c/p\u003e\u003cp\u003eMV\u0026thinsp;+\u0026thinsp;Adj(coGN)\u0026thinsp;+\u0026thinsp;hiSISSO\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.0288\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:(\\pm\\:0.0009\\)\u003c/span\u003e\u003c/span\u003e)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colspan=\"4\" nameend=\"c4\" namest=\"c1\"\u003e\u003cp\u003e*Data retrieved from \u003cem\u003eMatBench\u003c/em\u003e\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e in August 2025.\u003c/p\u003e\u003cp\u003e\u003csup\u003e\u0026dagger;\u003c/sup\u003e For brevity MV = (ℓ-MM+ℓ-OFM\u0026thinsp;+\u0026thinsp;MVL\u0026thinsp;+\u0026thinsp;SISSO\u0026thinsp;+\u0026thinsp;ROOST\u0026thinsp;+\u0026thinsp;ORB), i.e. all pretrained featurizers in MatterVial.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e graphically depicts the synergistic effects detailed in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e by comparing different models. The feature importance from the mean absolute SHAP values aggregated by feature group in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e(a-c) quantifies the contribution of the different groups of features and shows a clear shift in dominance as more powerful features are introduced. However, a closer inspection reveals important nuances regarding how these feature sets interact. In the MODNet@MV(no ORB) model, there is a relatively balanced and significant contribution from all feature groups, led by MVL, ℓ-MM, and SISSO, underscoring their collective utility. This is seen more clearly with the t-SNE projections of the SHAP value vectors for each feature in the model, where we can see these three sets of features covering most regions of the projection, but still some contributions of ROOST and ℓ-OFM features. When ORB features are introduced (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eb and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ee), they become the dominant contributor, explaining the dramatic reduction in MAE observed in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. Crucially, the ℓ-MM and SISSO features retain a significant portion of their importance with SISSO, being even among the highest contributors. This indicates that they capture complementary chemical information not fully encapsulated within the ORB latent space, explaining the slightly better result obtained compared to HackNIP\u0026rsquo;s MODNet@ORB model\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e. This hierarchical and synergistic contribution of features directly explains the visual improvement in the data manifold shown in the t-SNE projections (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eg-i). The feature space of the MODNet@MV(no ORB) model (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eg) shows some organization. However, the introduction of ORB features (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003eh) creates a significantly more structured manifold with a smoother gradient along the target property.\u003c/p\u003e\u003cp\u003eThis synergistic contribution continues in the final MODNet@MV\u0026thinsp;+\u0026thinsp;Adj(coGN) model. The inclusion of adjacent coGN features (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ei) results in the most well-defined feature space in the t-SNE projection, with the clearest separation between data points according to the target feature. While the task-specific coGN features predictably take the lead, the pretrained ORB and MVL features remain highly influential, serving as the second- and third-most important groups, respectively (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ec and \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003ef). In contrast, the contributions from ℓ-MM and SISSO are now marginal, as their predictive information has been superseded by more powerful GNN features. This layered view of contributions highlights the interpretability brought by feature-based models. In the following section, we showcase how this interpretability can be deepened using new MatterVial tools.\u003c/p\u003e\n\u003ch3\u003eInterpretability of MatterVial features\u003c/h3\u003e\n\u003cp\u003eWe begin by analyzing the most important features of the MODNet@MV(no ORB) model to understand what factors increase its accuracy in predicting the perovskite heat of formation (ΔH\u003csub\u003ef\u003c/sub\u003e). Unlike end-to-end GNNs, where features are deeply entangled through message passing, feature-based models have readily decoupled features, and SHAP values can be used to robustly assess the most important ones, as shown in the plot in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e. Utilizing the MatterVial Interpreter module, we can easily obtain SISSO formulas with up to five terms to approximate the GNN features of the included pretrained models. These approximations are based on interpretable descriptors from MatMiner and OFM. The plot displays the one-term formulas and their corresponding R\u0026sup2; values, demonstrating that even with relatively simple descriptors, these approximation formulas can achieve high R\u0026sup2; values for many meaningful GNN features.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThis analysis identifies several key feature groups that drive the predictions. Features from the MVL formation energy model, for instance, correlate stability with large electronegativity gaps (promoting ionic character) and low d-electron fractions, which favor early transition metals. Features generated by SISSO highlight structural drivers, rewarding dense atomic packing, ordered coordination environments, and specific stabilizing factors like 3\u0026Aring; interatomic contacts, while penalizing destabilizing electronic effects from excess d-electrons. Compositional features from ROOST and encoded MatMiner (ℓ-MM) models capture broader trends, showing that perovskites made of heavier, chemically diverse elements tend to be less stable and illustrating the balance between destabilizing wide-band-gap elements and the stabilizing effect of species with many unfilled d-states. Finally, encoded OFM (ℓ-OFM) features provide a granular view of bonding, distinguishing between the stabilizing interactions characteristic of oxides (e.g., s\u003csup\u003e2\u003c/sup\u003e\u0026thinsp;\u0026minus;\u0026thinsp;p\u003csup\u003e4\u003c/sup\u003e) and weaker bonds involving pnictogens. Collectively, this demonstrates that the model learns a multi-faceted and physically grounded understanding of perovskite stability. A full breakdown of the individual features shown in the figure is provided in the \u003cem\u003eSupplementary Information\u003c/em\u003e, section S10.\u003c/p\u003e\u003cp\u003eA comparative SHAP analysis of the best-performing MODNet@MV and MODNet@MV\u0026thinsp;+\u0026thinsp;adj(coGN) models, which incorporate richer ORB and coGN features (see SI, section S11, Figs. S4-S7), showed that while the MODNet@MV(no ORB) model primarily relies on fundamental chemical descriptors, the addition of ORB features shifts the emphasis of the model toward geometric information such as packing efficiency. The top-performing MODNet@MV\u0026thinsp;+\u0026thinsp;adj(coGN) model builds on this by capturing the most sophisticated features, representing a complex interplay between chemical and geometric properties. This increase in predictive power is accompanied by a decrease in direct interpretability. As the models become more complex, the ability to approximate their most important features with simple SISSO formulas diminishes (indicated by progressively lower R\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e values), and their correlation with classical descriptors weakens (Table S13). This progression highlights the gap between the complex features of high-performing GNNs and the limited descriptive power of interpretable descriptors, emphasizing the need for more flexible descriptors that remain compact for symbolic regression methods and interpretability.\u003c/p\u003e\u003cp\u003eTo test the utility of our GNN feature approximations, we conducted a two-stage experiment. In the first stage, we compared two types of SISSO models: a baseline using only MatMiner and OFM descriptors, and an enhanced version that added formulas approximating the GNN's most important features. For both model types, we apply a consistent methodology, utilizing several primary feature pre-selection algorithms\u0026mdash;including mRMR (i-SISSO)\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e, random forest importances (rf-SISSO)\u003csup\u003e\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003e, and our xgb-rfe-SISSO (SI, Sec. S8). The addition of the GNN-derived features yields a significant and consistent reduction in prediction error, as shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e(a). Our approach is analogous to hierarchical SISSO (hiSISSO)\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e, but it uniquely feeds back approximations of learned GNN features rather than terms from a prior SISSO model. In the second stage, we extract the terms from this enhanced SISSO model and incorporate them as new \"hiSISSO features\" to augment the MODNet@MV\u0026thinsp;+\u0026thinsp;adj(coGN) model. This augmented model further reduces the error to 0.0288 eV/unit cell. The t-SNE projection of SHAP value contributions and average feature importance of the classes in Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e(b,c) confirm their effectiveness, showing the high per-feature predictive power of hiSISSO features complementing the model. This demonstrates that explicit, interpretable formulas can improve generalization and raises the compelling question of whether GNN features could be replaced entirely if more expressive, physically grounded descriptors were available.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eIn this work, we introduced MatterVial, a unified and modular hybrid framework designed to bridge the gap between the predictive power of graph neural networks (GNNs) and the chemical transparency of traditional feature-based models in materials science. By augmenting the state-of-the-art feature-based model MODNet with a diverse and synergistic set of descriptors, this approach elevates its performance to be competitive with, and in several cases superior to, end-to-end GNNs. To summarize our contributions:\u003c/p\u003e\n\u003cp\u003e\u003cspan\u003e\u003c/span\u003e\u003c/p\u003e\n\u003cp\u003e(i) MatterVial is a novel open-source Python framework that generates a rich hybrid feature set. It integrates latent-space representations from various pretrained models, including structure-based GNNs such as MEGNet, an equivariant interatomic potential (ORB), and composition-based networks such as ROOST. The framework also uses computationally efficient GNN-approximated descriptors (ℓ-MM, ℓ-OFM) and features derived from symbolic regression.\u003c/p\u003e\u003cspan\u003e\n \u003cp\u003e(ii) The hybrid model demonstrates broad applicability and superior performance across the full MatBench v0.1 benchmark. It consistently reduces prediction errors across nearly all 13 tasks and establishes new state-of-the-art records for feature-based models in several categories.\u003c/p\u003e\n\u003c/span\u003e\u003cspan\u003e\n \u003cp\u003e(iii) A key innovation is a method that systematically decodes abstract GNN-derived features into more intuitive formulaic descriptors. This is achieved using surrogate models and symbolic regression to translate latent representations into explicit mathematical expressions based on fundamental physicochemical properties.\u003c/p\u003e\n\u003c/span\u003e\u003cspan\u003e\n \u003cp\u003e(iv) By incorporating features from an adjacent, task-specific GNN model, the framework enables a feature-based model to achieve predictive accuracy that is highly competitive with state-of-the-art GNNs while uniquely maintaining a modular and analyzable feature space.\u003c/p\u003e\n\u003c/span\u003e\u003cspan\u003e\n \u003cp\u003e(v) It was demonstrated that the interpretable formulas extracted from GNNs can be fed back into the model as new \u0026quot;hiSISSO features\u0026quot;, leading to a further reduction in prediction error. This confirms that the interpretability method can capture causally relevant physical information.\u003c/p\u003e\n\u003c/span\u003e\n\u003cp\u003e\u003c/p\u003e\n\u003cp\u003eIn conclusion, this work repositions feature-based modeling as a premier methodology in materials informatics. It delivers a practical solution that meets the dual demands of high accuracy and interpretability, a combination that is becoming increasingly critical in the field. While predictive accuracy is essential, interpretability allows researchers to validate that models have learned physically meaningful principles, thereby building trust and moving beyond simple prediction to genuine scientific understanding. This deeper insight accelerates materials discovery by enabling a shift from brute-force screening to more targeted, hypothesis-driven design. Ultimately, this alignment with the principles of explainable AI is a prerequisite for developing the next generation of autonomous discovery platforms, or \u0026ldquo;self-driving labs\u0026rdquo;, which require models that can not only predict outcomes but also explain the underlying principles to guide subsequent experiments.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\u003ch2\u003eMODNet model training\u003c/h2\u003e\u003cp\u003eThe MatMiner featurizer used throughout this work is described in detail in the \u003cem\u003eSupplementary Information\u003c/em\u003e, section S1. For all experiments incorporating MatterVial features, since many features are obtained, we perform an initial preselection of features using recursive feature elimination with XGBoost\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e to reduce the pool to 800 features. Subsequently, the built-in MODNet feature selection algorithm is used to select and rank a subset of these features that will be used for training. At this point, we can determine which groups of MatterVial features are relevant for a given task (\u0026ldquo;Best MatterVial groups\u0026rdquo; in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). The MODNet models are optimized via a genetic algorithm to select the best hyperparameters, and the optimal models in the validation set form deep ensembles, as described in Ref. 26, which are then used for evaluation in the test set and to obtain the final metrics.\u003c/p\u003e\u003cp\u003eThe mean absolute error (MAE) serves as the primary evaluation metric in regression tasks, and for classification tasks, the area under the receiver-operator curve (AUROC) is used. We consistently use a five-fold cross-validation method, as described in Matbench\u003csup\u003e\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e,\u003c/sup\u003e in all presented tasks. A Supplementary data repository with detailed results of our work is available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/rogeriog/MatterVial_SupportData\u003c/span\u003e\u003cspan address=\"https://github.com/rogeriog/MatterVial_SupportData\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eMatterVial implementation\u003c/h3\u003e\n\u003cp\u003e\u003cb\u003eMatterVial\u003c/b\u003e is an open-source featurizer tool implemented in Python (available at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/rogeriog/MatterVial\u003c/span\u003e\u003cspan address=\"https://github.com/rogeriog/MatterVial\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e) to enhance material property predictions by integrating pretrained descriptor-oriented and task-oriented GNNs, as well as precomputed symbolic formulas from traditional chemically intuitive descriptors. The package offers significant flexibility and modularity, allowing the extraction of features from different layers of pretrained models and the incorporation of other GNN models as needed. The following outlines each MatterVial featurizer employed:\u003c/p\u003e\u003cp\u003e● \u003cb\u003eℓ-OFM featurizer\u003c/b\u003e: the OFM featurizer captures valence electron interactions at each atomic site by employing a weighted vector outer product of one-hot encoded valence orbitals for every atom (details in the \u003cem\u003eSupplementary Information\u003c/em\u003e, section S2, Fig. \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). The structural representation is achieved by averaging all local OFMs. We apply the OFM featurizer to a subset of the Materials Project MP-crystals-2018.6.1\u003csup\u003e50\u003c/sup\u003e dataset with 106,113 structures whose energy above the convex hull was lower than 150 meV, nicknamed MP2018-stable, followed by training an autoencoder to derive a latent space representation. The latent OFM features are subsequently used as targets to train a GNN model that generates these features directly from the initial structures.\u003c/p\u003e\u003cp\u003e● \u003cb\u003eℓ-MM featurizer\u003c/b\u003e : following a similar procedure to the OFM featurizer, we encode features obtained from the default MatMiner featurizer of MODNet v.0.1.13 applied to the MP2018-stable dataset, resulting in 1,336 MatMiner features. The selected compression level provides latent MatMiner features (ℓ-MM), which are then used as targets to train a GNN model that directly generates these features from the original structures.\u003c/p\u003e\u003cp\u003eThe \u003cb\u003eDescriptorMEGNetFeaturizer\u003c/b\u003e class in the MatterVial package is implemented to retrieve OFM-encoded and MatMiner-encoded features from the MatterVial package. A thorough investigation of these encoded features, including the use of different compression levels and hyperparameters was conducted, as detailed in the \u003cem\u003eSupporting Information\u003c/em\u003e (sections S4, S5, S7 and Fig. S3, also Tables S1\u0026ndash;S3, S5\u0026ndash;S8, S10\u0026ndash;S11).\u003c/p\u003e\u003cp\u003e● \u003cb\u003eMVL MatterVial featurizers\u003c/b\u003e: Utilizing the \u003cb\u003eMVLFeaturizer\u003c/b\u003e class from the MatterVial package, we incorporate five pretrained MEGNet models provided by the Materials Virtual Lab\u003csup\u003e\u003cspan citationid=\"CR50\" class=\"CitationRef\"\u003e50\u003c/span\u003e\u003c/sup\u003e. Specifically, these are the models trained for the formation energy, Fermi energy, and elastic constants K\u003csup\u003eVRH\u003c/sup\u003e and G\u003csup\u003eVRH\u003c/sup\u003e on the 2019.4.1 Materials Project crystals dataset, as well as the band gap regression model trained on the 2018.6.1 Materials Project crystals dataset. The default MEGNet architecture comprises MEGNet blocks followed by an MLP with two dense layers, one with 32 neurons and the other with 16 neurons, before producing the target property (see section S3, Fig. S2, in \u003cem\u003eSupplementary Information\u003c/em\u003e). The modularity of the MatterVial package allows us to extract features from different layers of these pretrained models. We extract features from the MLP layers preceding the output, specifically from the 32-neuron (layer32) and 16-neuron (layer16) configurations. An investigation was conducted on the effect of using the different layers for prediction as provided in \u003cem\u003eSupplementary Information\u003c/em\u003e, section S6, Table S4. For this paper, the extracted features of both layers (160 descriptors for layer32 and 80 descriptors for layer16) are concatenated and added to the final feature vector.\u003c/p\u003e\u003cp\u003e● \u003cb\u003eAdjacent GNN featurizer\u003c/b\u003e: The \u003cb\u003eAdjacentGNNFeaturizer\u003c/b\u003e class from the MatterVial package is employed to train a MEGNet or coGN model on the fly for each fold of the train-test split. This adjacent model captures task-specific data nuances, enhancing prediction accuracy. The default hyperparameters from MEGNet v.1.3.2 and coGN are utilized, as detailed in the \u003cem\u003eSupplementary Information\u003c/em\u003e, section S7.2, Table S9.\u003c/p\u003e\u003cp\u003e● \u003cb\u003eSISSO-based formula featurizer\u003c/b\u003e: The SISSO\u0026thinsp;+\u0026thinsp;+\u0026thinsp;framework\u003csup\u003e51\u003c/sup\u003e was used to generate symbolic expressions that approximate target material properties across 15 datasets (see \u003cem\u003eSupplementary Information\u003c/em\u003e, section S8, Table S12 for details) by transforming MatMiner features. The method begins by recursively applying a predefined set of operators (e.g., addition, subtraction, multiplication, division, sine, cosine, exponential, and logarithm) to expand the feature space, followed by sure-independence screening (SIS) that ranks the resulting candidates by their correlation with the target property and a sparsification step that selects a compact descriptor set. For our configuration, restricted to rung one, this yields 20 paired-feature formulas. By opting for the expressions produced at the SIS step instead of the final SISSO formula, versatility and generalization are assured when integrated with MODNet neural networks. These formulas, derived for each of the 15 tasks, are compiled in the file SISSO_FORMULAS_v1.txt, which is accessed by the \u003cb\u003eget_sisso_features\u003c/b\u003e function in MatterVial to process the given MatMiner features (either directly or decoded from ℓ-MM) and outputs a dataframe of evaluated expressions.\u003c/p\u003e\u003cp\u003eIn terms of computational cost, generating the complete feature set with MatterVial is substantially more efficient than traditional MatMiner featurization. Although the precise runtime for MatMiner is highly dataset-dependent, our observations indicate that MatterVial reduces feature generation time by a minimum of two orders of magnitude, especially when leveraging GPUs.\u003c/p\u003e\n\u003ch3\u003eRetrieving interpretability via MatterVial’s interpreter module\u003c/h3\u003e\n\u003cp\u003eBefore employing MatterVial\u0026rsquo;s interpreter module, we conduct a SHAP value analysis (see \u003cem\u003eSupplementary Information\u003c/em\u003e, section S9) on our MODNet models to assess feature importance. Using the SHAP Python library, we perform the analysis with 300 samples and 500 perturbations on 24 CPU cores in about 20 minutes, revealing the features with the greatest impact on the model predictions.\u003c/p\u003e\u003cp\u003eTo bridge the gap between high-level latent representations and interpretable chemical descriptors, MatterVial leverages surrogate XGBoost models. These models are trained to predict each latent feature based on the previous assessment using the MP2018-stable dataset featurized with interpretable MatMiner and OFM features. The tree-based additive structure of XGBoost ensures rapid and parallel training as well as efficient SHAP calculations. For each feature, the top 30 most influential interpretable descriptors, as determined by SHAP, are forwarded to SISSO++, which performs symbolic regression to retrieve a symbolic formula that better correlates with the latent feature. This process is illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e. The SHAP decompositions and SISSO formula for the latent-space features computed this way can be retrieved by calling the \u003cb\u003eInterpreter\u003c/b\u003e class and invoking \u003cb\u003eget_shap_values\u003c/b\u003e or \u003cb\u003eget_formula\u003c/b\u003e with the feature name as generated by MatterVial.\u003c/p\u003e\u003cp\u003eMoreover, this interpretability framework extends to adjacent GNN models. Using MatterVial\u0026rsquo;s tools, including the AdjacentGNNFeaturizer, a task-specific GNN model is trained and its latent features can be interpreted following the previous pipeline for which helper functions are provided. In this way, both the pretrained models imported by MatterVial and the adjacent models trained on the fly benefit from enhanced transparency, enabling users to decode the underlying chemical principles driving the predictions.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe acknowledge the supercomputing facilities of the Universit\u0026eacute; catholique de Louvain (CISM/UCL) and the Consortium des \u0026Eacute;quipements de Calcul Intensif en F\u0026eacute;d\u0026eacute;ration Wallonie Bruxelles (C\u0026Eacute;CI) for computational resources. This work also made use of Lucia, the Tier-1 supercomputer of the Walloon Region.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was financed in part by the Coordena\u0026ccedil;\u0026atilde;o de Aperfei\u0026ccedil;oamento de Pessoal de N\u0026iacute;vel Superior \u0026ndash; Brazil (CAPES) \u0026ndash; Finance Code 001 and Universit\u0026eacute; catholique de Louvain (Ref. No. ARH/MKK/01155839).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eR.G., P.D.B., and G.M.R. conceptualized the study. R.G. performed the experiments and data analysis, and wrote the main manuscript. P.D.B. and G.M.R. supervised the investigation. R.G., P.D.B., T.P., G.M.R. and M.J.L.S. reviewed and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eA Supplementary data repository with detailed results of our work is available at https:/github.com/rogeriog/MatterVial_SupportData .\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eRodrigues, J. F., Florea, L., De Oliveira, M. C. F., Diamond, D. \u0026amp; Oliveira, O. N. Big data and machine learning for materials science. \u003cem\u003eDiscov Mater\u003c/em\u003e \u003cstrong\u003e1\u003c/strong\u003e, 12 (2021).\u003c/li\u003e\n\u003cli\u003eDey, A. \u003cem\u003eet al.\u003c/em\u003e State of the Art and Prospects for Halide Perovskite Nanocrystals. \u003cem\u003eACS Nano\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 10775\u0026ndash;10981 (2021).\u003c/li\u003e\n\u003cli\u003eGuo, K., Yang, Z., Yu, C.-H. \u0026amp; Buehler, M. J. Artificial intelligence and machine learning in design of mechanical materials. \u003cem\u003eMater. Horiz.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 1153\u0026ndash;1172 (2021).\u003c/li\u003e\n\u003cli\u003eDe Breuck, P.-P., Hautier, G. \u0026amp; Rignanese, G. M. Materials property prediction for limited datasets enabled by feature selection and joint learning with MODNet. \u003cem\u003enpj Computational Materials\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 1\u0026ndash;8 (2021).\u003c/li\u003e\n\u003cli\u003eZhang, B., Zhou, M., Wu, J. \u0026amp; Gao, F. Predicting the Materials Properties Using a 3D Graph Neural Network With Invariant Representation. \u003cem\u003eIEEE Access\u003c/em\u003e \u003cstrong\u003e10\u003c/strong\u003e, 62440\u0026ndash;62449 (2022).\u003c/li\u003e\n\u003cli\u003eTawfik, S. A. \u0026amp; Russo, S. P. Naturally-meaningful and efficient descriptors: machine learning of material properties based on robust one-shot ab initio descriptors. \u003cem\u003eJ Cheminform\u003c/em\u003e \u003cstrong\u003e14\u003c/strong\u003e, 78 (2022).\u003c/li\u003e\n\u003cli\u003eWard, L. \u003cem\u003eet al.\u003c/em\u003e Matminer: An open source toolkit for materials data mining. \u003cem\u003eComputational Materials Science\u003c/em\u003e \u003cstrong\u003e152\u003c/strong\u003e, 60\u0026ndash;69 (2018).\u003c/li\u003e\n\u003cli\u003ePretto, T., Baum, F., Gouv\u0026ecirc;a, R. A., Brolo, A. G. \u0026amp; Santos, M. J. L. Optimizing the Synthesis Parameters of Double Perovskites with Machine Learning Using a Multioutput Regression Model. \u003cem\u003eJ. Phys. Chem. C\u003c/em\u003e \u003cstrong\u003e128\u003c/strong\u003e, 7041\u0026ndash;7052 (2024).\u003c/li\u003e\n\u003cli\u003eLiu, J. \u003cem\u003eet al.\u003c/em\u003e Toward Excellence of Electrocatalyst Design by Emerging Descriptor‐Oriented Machine Learning. \u003cem\u003eAdv Funct Materials\u003c/em\u003e \u003cstrong\u003e32\u003c/strong\u003e, 2110748 (2022).\u003c/li\u003e\n\u003cli\u003eKim, G.-H., Lee, C., Kim, K. \u0026amp; Ko, D.-H. Novel structural feature-descriptor platform for machine learning to accelerate the development of organic photovoltaics. \u003cem\u003eNano Energy\u003c/em\u003e \u003cstrong\u003e106\u003c/strong\u003e, 108108 (2023).\u003c/li\u003e\n\u003cli\u003eLi, S. \u003cem\u003eet al.\u003c/em\u003e Encoding the atomic structure for machine learning in materials science. \u003cem\u003eWIREs Comput Mol Sci\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, e1558 (2022).\u003c/li\u003e\n\u003cli\u003eDunn, A. MatBench Leaderboard. https://matbench.materialsproject.org/ (2024).\u003c/li\u003e\n\u003cli\u003eRiebesell, J. \u003cem\u003eet al.\u003c/em\u003e A framework to evaluate machine learning crystal stability predictions. \u003cem\u003eNat Mach Intell\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 836\u0026ndash;847 (2025).\u003c/li\u003e\n\u003cli\u003eLam Pham, T. \u003cem\u003eet al.\u003c/em\u003e Machine learning reveals orbital interaction in materials. \u003cem\u003eScience and Technology of Advanced Materials\u003c/em\u003e \u003cstrong\u003e18\u003c/strong\u003e, 756\u0026ndash;765 (2017).\u003c/li\u003e\n\u003cli\u003eBart\u0026oacute;k, A. P., Kondor, R. \u0026amp; Cs\u0026aacute;nyi, G. On representing chemical environments. \u003cem\u003ePhys. Rev. B\u003c/em\u003e \u003cstrong\u003e87\u003c/strong\u003e, 184115 (2013).\u003c/li\u003e\n\u003cli\u003eKanter, J. M. \u0026amp; Veeramachaneni, K. Deep feature synthesis: Towards automating data science endeavors. in \u003cem\u003e2015 IEEE International Conference on Data Science and Advanced Analytics (DSAA)\u003c/em\u003e 1\u0026ndash;10 (IEEE, Campus des Cordeliers, Paris, France, 2015). doi:10.1109/DSAA.2015.7344858.\u003c/li\u003e\n\u003cli\u003eArik, S. O. \u0026amp; Pfister, T. TabNet: Attentive Interpretable Tabular Learning. Preprint at https://doi.org/10.48550/arXiv.1908.07442 (2020).\u003c/li\u003e\n\u003cli\u003eRuff, R., Reiser, P., St\u0026uuml;hmer, J. \u0026amp; Friederich, P. Connectivity Optimized Nested Graph Networks for Crystal Structures. Preprint at http://arxiv.org/abs/2302.14102 (2023).\u003c/li\u003e\n\u003cli\u003eKo, T. W. \u003cem\u003eet al.\u003c/em\u003e Materials Graph Library (MatGL), an open-source graph deep learning library for materials science and chemistry. Preprint at https://doi.org/10.48550/arXiv.2503.03837 (2025).\u003c/li\u003e\n\u003cli\u003eGoodall, R. E. A. \u0026amp; Lee, A. A. Predicting materials properties without crystal structure: deep representation learning from stoichiometry. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e11\u003c/strong\u003e, (2020).\u003c/li\u003e\n\u003cli\u003eRhodes, B. \u003cem\u003eet al.\u003c/em\u003e Orb-v3: atomistic simulation at scale. Preprint at https://doi.org/10.48550/arXiv.2504.06231 (2025).\u003c/li\u003e\n\u003cli\u003eShiota, T., Ishihara, K. \u0026amp; Mizukami, W. Universal neural network potentials as descriptors: towards scalable chemical property prediction using quantum and classical computers. \u003cem\u003eDigital Discovery\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, 1714\u0026ndash;1728 (2024).\u003c/li\u003e\n\u003cli\u003eEl-Samman, A. M. \u003cem\u003eet al.\u003c/em\u003e Global geometry of chemical graph neural network representations in terms of chemical moieties. \u003cem\u003eDigital Discovery\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, 544\u0026ndash;557 (2024).\u003c/li\u003e\n\u003cli\u003eEl-Samman, A. M., De Castro, S., Morton, B. \u0026amp; De Baerdemacker, S. Transfer learning graph representations of molecules for pKa,\u003csup\u003e13\u003c/sup\u003e C-NMR, and solubility. \u003cem\u003eCan. J. Chem.\u003c/em\u003e \u003cstrong\u003e102\u003c/strong\u003e, 275\u0026ndash;288 (2024).\u003c/li\u003e\n\u003cli\u003eElijo\u0026scaron;ius, R. \u003cem\u003eet al.\u003c/em\u003e Zero shot molecular generation via similarity kernels. \u003cem\u003eNat Commun\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, 5991 (2025).\u003c/li\u003e\n\u003cli\u003eKim, S. Y., Park, Y. J. \u0026amp; Li, J. Leveraging neural network interatomic potentials for a foundation model of chemistry. Preprint at https://doi.org/10.48550/arXiv.2506.18497 (2025).\u003c/li\u003e\n\u003cli\u003eOviedo, F., Ferres, J. L., Buonassisi, T. \u0026amp; Butler, K. T. Interpretable and Explainable Machine Learning for Materials Science and Chemistry. \u003cem\u003eAcc. Mater. Res.\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, 597\u0026ndash;607 (2022).\u003c/li\u003e\n\u003cli\u003ePilania, G. Machine learning in materials science: From explainable predictions to autonomous design. \u003cem\u003eComputational Materials Science\u003c/em\u003e \u003cstrong\u003e193\u003c/strong\u003e, (2021).\u003c/li\u003e\n\u003cli\u003eAbolhasani, M. \u0026amp; Kumacheva, E. The rise of self-driving labs in chemical and materials sciences. \u003cem\u003eNat. Synth\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, 483\u0026ndash;492 (2023).\u003c/li\u003e\n\u003cli\u003eDunn, A., Wang, Q., Ganose, A., Dopp, D. \u0026amp; Jain, A. Benchmarking materials property prediction methods: the Matbench test set and Automatminer reference algorithm. \u003cem\u003enpj Computational Materials\u003c/em\u003e \u003cstrong\u003e6\u003c/strong\u003e, 1\u0026ndash;10 (2020).\u003c/li\u003e\n\u003cli\u003eDe Breuck, P.-P., Evans, M. L. \u0026amp; Rignanese, G.-M. Robust model benchmarking and bias-imbalance in data-driven materials science: a case study on MODNet. \u003cem\u003eJ. Phys.: Condens. Matter\u003c/em\u003e \u003cstrong\u003e33\u003c/strong\u003e, 404002 (2021).\u003c/li\u003e\n\u003cli\u003eCitrine Informatics. Mechanical properties of some steels.\u003c/li\u003e\n\u003cli\u003eChoudhary, K., Kalish, I., Beams, R. \u0026amp; Tavazza, F. High-throughput Identification and Characterization of Two-dimensional Materials using Density functional theory. \u003cem\u003eSci Rep\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 5179 (2017).\u003c/li\u003e\n\u003cli\u003ePetretto, G. \u003cem\u003eet al.\u003c/em\u003e High-throughput density-functional perturbation theory phonons for inorganic materials. \u003cem\u003eSci Data\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 180065 (2018).\u003c/li\u003e\n\u003cli\u003eZhuo, Y., Mansouri Tehrani, A. \u0026amp; Brgoch, J. Predicting the Band Gaps of Inorganic Solids by Machine Learning. \u003cem\u003eJ. Phys. Chem. Lett.\u003c/em\u003e \u003cstrong\u003e9\u003c/strong\u003e, 1668\u0026ndash;1673 (2018).\u003c/li\u003e\n\u003cli\u003ePetousis, I. \u003cem\u003eet al.\u003c/em\u003e High-throughput screening of inorganic compounds for the discovery of novel dielectric and optical materials. \u003cem\u003eSci Data\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e, 160134 (2017).\u003c/li\u003e\n\u003cli\u003eJain, A. \u003cem\u003eet al.\u003c/em\u003e Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. \u003cem\u003eAPL materials\u003c/em\u003e \u003cstrong\u003e1\u003c/strong\u003e, 11002 (2013).\u003c/li\u003e\n\u003cli\u003eKawazoe, Y., Masumoto, T., Tsai, A.-P., Yu, J.-Z. \u0026amp; Aihara Jr., T. 1 Introduction. in \u003cem\u003eNonequilibrium Phase Diagrams of Ternary Amorphous Alloys\u003c/em\u003e (eds. Kawazoe, Y., Yu, J.-Z., Tsai, A.-P. \u0026amp; Masumoto, T.) vol. 37A 1\u0026ndash;32 (Springer-Verlag, Berlin/Heidelberg, 1997).\u003c/li\u003e\n\u003cli\u003eWard, L., Agrawal, A., Choudhary, A. \u0026amp; Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. \u003cem\u003enpj Comput Mater\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, 16028 (2016).\u003c/li\u003e\n\u003cli\u003eDe Jong, M. \u003cem\u003eet al.\u003c/em\u003e Charting the complete elastic properties of inorganic crystalline compounds. \u003cem\u003eSci Data\u003c/em\u003e \u003cstrong\u003e2\u003c/strong\u003e, 150009 (2015).\u003c/li\u003e\n\u003cli\u003eCastelli, I. E. \u003cem\u003eet al.\u003c/em\u003e New cubic perovskites for one- and two-photon water splitting using the computational materials repository. \u003cem\u003eEnergy Environ. Sci.\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 9034 (2012).\u003c/li\u003e\n\u003cli\u003eShen, J. \u003cem\u003eet al.\u003c/em\u003e Reflections on one million compounds in the open quantum materials database (OQMD). \u003cem\u003eJ. Phys. Mater.\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, 031001 (2022).\u003c/li\u003e\n\u003cli\u003eSchmidt, J. \u003cem\u003eet al.\u003c/em\u003e Machine‐Learning‐Assisted Determination of the Global Zero‐Temperature Phase Diagram of Materials. \u003cem\u003eAdvanced Materials\u003c/em\u003e \u003cstrong\u003e35\u003c/strong\u003e, 2210788 (2023).\u003c/li\u003e\n\u003cli\u003eBarroso-Luque, L. \u003cem\u003eet al.\u003c/em\u003e Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models. Preprint at https://doi.org/10.48550/arXiv.2410.12771 (2024).\u003c/li\u003e\n\u003cli\u003ematbench_v0.1: MegNet (kgcnn v2.1.0) - MatBench. \u003cem\u003eMaterialsproject.org\u003c/em\u003e https://matbench.materialsproject.org/Full%20Benchmark%20Data/matbench_v0.1_MegNet_kgcnn_v2.1.0/ (2020).\u003c/li\u003e\n\u003cli\u003eXu, Y. \u0026amp; Qian, Q. i-SISSO: Mutual information-based improved sure independent screening and sparsifying operator algorithm. \u003cem\u003eEngineering Applications of Artificial Intelligence\u003c/em\u003e \u003cstrong\u003e116\u003c/strong\u003e, 105442 (2022).\u003c/li\u003e\n\u003cli\u003eJiang, X., Liu, G., Xie, J. \u0026amp; Hu, Z. Boosting SISSO Performance on Small Sample Datasets by Using Random Forests Prescreening for Complex Feature Selection. Preprint at https://doi.org/10.48550/ARXIV.2409.19209 (2024).\u003c/li\u003e\n\u003cli\u003eFoppa, L., Purcell, T. A. R., Levchenko, S. V., Scheffler, M. \u0026amp; Ghiringhelli, L. M. Hierarchical Symbolic Regression for Identifying Key Physical Parameters Correlated with Bulk Properties of Perovskites. \u003cem\u003ePhys. Rev. Lett.\u003c/em\u003e \u003cstrong\u003e129\u003c/strong\u003e, 055301 (2022).\u003c/li\u003e\n\u003cli\u003eChen, T. \u0026amp; Guestrin, C. XGBoost: A Scalable Tree Boosting System. in \u003cem\u003eProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining\u003c/em\u003e 785\u0026ndash;794 (ACM, San Francisco California USA, 2016). doi:10.1145/2939672.2939785.\u003c/li\u003e\n\u003cli\u003eChen, C., Ye, W., Zuo, Y., Zheng, C. \u0026amp; Ong, S. P. Graph Networks as a Universal Machine Learning Framework for Molecules and Crystals. \u003cem\u003eChem. Mater.\u003c/em\u003e \u003cstrong\u003e31\u003c/strong\u003e, 3564\u0026ndash;3572 (2019).\u003c/li\u003e\n\u003cli\u003ePurcell, T. A. R., Scheffler, M. \u0026amp; Ghiringhelli, L. M. Recent advances in the SISSO method and their implementation in the SISSO++ code. \u003cem\u003eThe Journal of Chemical Physics\u003c/em\u003e \u003cstrong\u003e159\u003c/strong\u003e, 114110 (2023).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"npj-computational-materials","isNatureJournal":false,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"npjcompumats","sideBox":"Learn more about [npj Computational Materials](http://www.nature.com/npjcompumats/)","snPcode":"41524","submissionUrl":"https://mts-npjcompumats.nature.com/","title":"npj Computational Materials","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Feature-based machine learning, MODNet, graph neural networks, materials informatics, interpretability","lastPublishedDoi":"10.21203/rs.3.rs-7518209/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7518209/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis study introduces MatterVial, an innovative hybrid framework for feature-based machine learning in materials science. MatterVial expands the feature space by integrating latent representations from a diverse suite of pretrained graph neural network (GNN) models\u0026mdash;including structure-based (MEGNet), composition-based (ROOST), and equivariant (ORB) graph networks\u0026mdash;with computationally efficient, GNN-approximated descriptors and novel features from symbolic regression. Our approach combines the chemical transparency of traditional feature-based models with the predictive power of deep learning architectures. When augmenting the feature-based model MODNet on Matbench tasks, this method yields significant error reductions and elevates its performance to be competitive with, and in several cases superior to, state-of-the-art end-to-end GNNs, with accuracy increases exceeding 40% for multiple tasks. An integrated interpretability module, employing surrogate models and symbolic regression, decodes the latent GNN-derived descriptors into explicit, physically meaningful formulas. This unified framework advances materials informatics by providing a high-performance, transparent tool that aligns with the principles of explainable AI, paving the way for more targeted and autonomous materials discovery.\u003c/p\u003e","manuscriptTitle":"Combining feature-based approaches with graph neural networks and symbolic regression for synergistic performance and interpretability","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-17 05:13:22","doi":"10.21203/rs.3.rs-7518209/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-10-10T07:56:22+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-10-01T21:49:32+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-09-25T15:32:02+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"252826971165169045277286272052134288794","date":"2025-09-15T04:08:07+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"259713382065813677354622502364993692190","date":"2025-09-10T17:07:14+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"313348948999022013963832617996752610578","date":"2025-09-08T16:03:10+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-09-08T15:56:23+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-09-08T11:22:34+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-09-06T17:26:07+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Computational Materials","date":"2025-09-02T13:14:08+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"npj-computational-materials","isNatureJournal":false,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"npjcompumats","sideBox":"Learn more about [npj Computational Materials](http://www.nature.com/npjcompumats/)","snPcode":"41524","submissionUrl":"https://mts-npjcompumats.nature.com/","title":"npj Computational Materials","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"5648b930-e7ef-412b-8d92-f5f6aa371d5e","owner":[],"postedDate":"September 17th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[{"id":54525399,"name":"Physical sciences/Engineering"},{"id":54525400,"name":"Physical sciences/Materials science"},{"id":54525401,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2026-01-19T16:46:05+00:00","versionOfRecord":{"articleIdentity":"rs-7518209","link":"https://doi.org/10.1038/s41524-025-01938-2","journal":{"identity":"npj-computational-materials","isVorOnly":false,"title":"npj Computational Materials"},"publishedOn":"2026-01-15 16:29:19","publishedOnDateReadable":"January 15th, 2026"},"versionCreatedAt":"2025-09-17 05:13:22","video":"","vorDoi":"10.1038/s41524-025-01938-2","vorDoiUrl":"https://doi.org/10.1038/s41524-025-01938-2","workflowStages":[]},"version":"v1","identity":"rs-7518209","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7518209","identity":"rs-7518209","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0