Predictive Bioactivity Modeling and Structural Binding Analysis for the Identification of Potential SMYD3 Modulators

doi:10.21203/rs.3.rs-8662415/v1

Predictive Bioactivity Modeling and Structural Binding Analysis for the Identification of Potential SMYD3 Modulators

2026 · doi:10.21203/rs.3.rs-8662415/v1

preprint OA: closed

Full text JSON View at publisher

Full text 162,936 characters · extracted from preprint-html · click to expand

Predictive Bioactivity Modeling and Structural Binding Analysis for the Identification of Potential SMYD3 Modulators | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Predictive Bioactivity Modeling and Structural Binding Analysis for the Identification of Potential SMYD3 Modulators Abdullah R. Alzahrani, Zia Ur Rehman, Talha Jawaid, Abida Khan This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8662415/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 10 Apr, 2026 Read the published version in Molecular Diversity → Version 1 posted 9 You are reading this latest preprint version Abstract SMYD3 is a lysine methyltransferase involved in epigenetic regulation and oncogenic transcription, making it an attractive yet challenging therapeutic target. This study presents an integrated computational workflow combining machine learning based quantitative structure-activity relationship (QSAR) modelling, external bioactivity prediction, molecular docking, molecular dynamics (MD) simulations, and network analysis to prioritize potential SMYD3 inhibitors. ML-QSAR models were constructed using multiple molecular descriptor representations and regression algorithms. A MACCS fingerprint-based Random Forest model showed the most reliable external predictivity, supported by cross-validation, applicability domain assessment, and Y-randomization analysis. Feature interpretability using SHAP highlighted a small set of chemically meaningful structural patterns that consistently influenced activity prediction. The validated model was then applied to an external compound library, and bioactivity was predicted only for compounds lying within the defined applicability domain. This screening enabled the prioritization of in-domain candidates with moderate predicted potency and acceptable structural coverage relative to the training space. Structure-based evaluation using the crystallographic SMYD3 structure demonstrated that selected compounds bind within the experimentally validated active site and engage key residues observed in the co-crystal complex. Extended 250 ns MD simulations indicated that CHEMBL4472528 maintained stable binding, persistent polar and hydrophobic interactions, and favorable binding free energies compared with both the co-crystal ligand and other screened candidates. Network and pathway analysis further placed SMYD3 within a focused chromatin-associated and transcriptional regulatory context, supporting the biological relevance of the target. This work provides a reproducible computational framework for SMYD3 inhibitor prioritization and highlights CHEMBL4472528 as a promising scaffold for further investigation. SMYD3 Cancer Machine learning QSAR Prediction model Molecular Modelling Network Biology Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Introduction Epigenetic regulation plays a central role in shaping transcriptional programs that govern cellular identity, metabolism, and disease progression. Among epigenetic writers, protein lysine methyltransferases influence chromatin organization through site-specific histone methylation, thereby modulating transcriptional output without altering DNA sequence [ 1 , 2 ]. Dysregulation of these enzymes has been widely implicated in cancer, where altered histone methylation supports oncogenic transcriptional states and adaptive cellular phenotypes [ 2 , 3 ]. SMYD3 (SET and MYND domain-containing protein 3) has emerged as a context-dependent histone methyltransferase that primarily regulates transcriptional activity rather than acting as a global chromatin modifier [ 4 , 5 ]. Accumulating evidence indicates that SMYD3 contributes to disease-associated transcriptional amplification by modulating oncogenic signaling pathways, metabolic reprogramming, and stress-responsive gene expression programs. Unlike broadly acting epigenetic enzymes, SMYD3 appears to exert its effects in a pathway-selective manner, which makes it an attractive but still incompletely explored target for chemical modulation [ 5 – 7 ]. Structural and biochemical studies have revealed that SMYD3 has a well-defined catalytic pocket that can accommodate small molecules that interfere with substrate recognition and methyltransferase activity [ 8 ]. Several SMYD3 inhibitors have been identified via experimental screening and structure-guided medicinal chemistry efforts, and curated inhibitory bioactivity data are now available to the public [ 4 , 9 ]. However, these data have yet to be exploited systematically using ligand-based quantitative analysis of activity to identify chemical features that control inhibitory potency in diverse scaffolds. In parallel, SMYD3 has been positioned within a compact epigenetic and transcriptional regulatory network involving histone H3 variants, chromatin-associated methyltransferases, and transcription-related cofactors [ 8 , 10 , 11 ]. Pathway and disease association analyses consistently link SMYD3 to cancer-related phenotypes, reinforcing its relevance in oncogenic transcriptional control [ 12 ]. Despite this growing biological understanding, integrative studies connecting chemical determinants of SMYD3 inhibition with systems-level biological context remain limited. Chemical modulation of SMYD3 is therefore expected to interfere with substrate recognition and methyltransferase activity, providing a rational basis for targeting its transcriptional regulatory function. This gap constrains the rational prioritisation of candidate modulators and hinders the interpretation of how chemical inhibition may translate into biologically relevant effects. The present study addresses this gap by applying a quantitative activity analysis framework to characterise and prioritise small-molecule SMYD3 modulators using curated inhibitory bioactivity data. Predictive models were developed to identify key molecular features associated with SMYD3 inhibition and were applied to external compounds within a defined applicability domain. Structure-guided analysis and molecular dynamics simulations based on an experimentally validated SMYD3 structure were then used to assess binding stability and mechanistic plausibility. Finally, network and pathway analysis were employed to place SMYD3 targeting within its biological context. This whole systems approach brings a clear, mechanistically based perspective on potential SMYD3 modulators while staying well within the confines of computational inference. Methods and Materials Data collection and preprocessing We pulled the SMYD3 bioactivity data from the ChEMBL database (target ID: CHEMBL2321643; UniProt accession: Q9H7B4). Only small-molecule inhibitors with experimentally reported IC₅₀ (nanomolar) values were considered. Starting with an initial number of 1,873 records, we transformed all activity values into pIC 50 to make it easier to model regressions (Table S1 ). During the curation process, we removed duplicates, inconsistent measurements, and any records with missing or unclear activity data. This resulted in a final set of 1,483 unique compounds to run for a QSAR analysis (Table S2). For descriptive purposes, compounds were grouped into activity classes based on commonly used pIC₅₀ thresholds. Molecules with pIC₅₀ values of 7.0 or higher were classified as active, those with pIC₅₀ values below 6.0 were labelled inactive, and compounds falling between these thresholds were considered intermediate (Table S3). Molecular representation and descriptor preprocessing Molecular descriptors were generated to represent chemical structures in a machine-learning compatible format. Fingerprints and descriptor matrices were computed using a PaDEL-based workflow [ 13 ], including four descriptor spaces: MACCS keys, PubChem fingerprints, CDK descriptors, and Substructure fingerprints. Each descriptor set was produced using identical compound ordering to ensure that the same molecules were consistently aligned across all representations. To improve model stability and reduce noise from non-informative variables, a low-variance feature filtering step was applied independently to each descriptor set. After variance-based removal, the final feature counts were: 62 features for MACCS, 120 features for PubChem, 12 features for Substructure, and 504 features for CDK. These reduced descriptor matrices were used in downstream QSAR modelling and comparative evaluation. QSAR model development and validation QSAR models were developed using a supervised regression framework to predict pIC₅₀ values based on multiple molecular descriptor sets, including MACCS, PubChem, CDK, and Substructure fingerprints. The curated dataset was divided using a two-step random splitting strategy, with 70% of compounds assigned to the training set and the remaining 30% further split equally into validation (15%) and independent test (15%) sets. All splits were performed with random shuffling and fixed seeds to ensure reproducibility. Several machine learning algorithms were evaluated across descriptor spaces, including Random Forest, Extra Trees, Gradient Boosting Regressor, Histogram-based Gradient Boosting Regressor, Ridge Regression with scaling, and Support Vector Regression. Tree-based methods were primarily employed to capture non-linear structure-activity relationships, while linear and kernel-based models were included for comparison. Descriptor matrices were converted to numeric format, and missing values were handled appropriately before model fitting. We tested the model's performance on the training, validation, and test datasets using standard regression metrics. To ensure that the model is generalizing well to new data, we measured the external predictive power of the model using Q² ext and the Tropsha criteria (Q² F1 , Q² F2 , and Q² F3 ). We retained the predictions from each descriptor-model combination and complete compound metadata. This allowed us to go straight into applicability domain analysis and virtual screening with all the necessary contexts. Model selection and robustness analysis We systematically compared QSAR models to obtain the best descriptor-algorithmic combinations. We prioritized the model selection based on the performance of the test sets, so the models were ranked according to the RMSE and CCC, in order of their performance. This approach guarantees that the final model has a low prediction error as well as a strong agreement between the observed and predicted activities. In addition to overall ranking, the best-performing model within each descriptor space was identified to enable descriptor-wise comparison. Model robustness was assessed using five-fold cross-validation on the training data, with performance reported as mean and standard deviation values to evaluate stability across different data partitions. The applicability domain of the selected models was defined using a leverage-based approach and visualised through Williams plots, allowing identification of in-domain predictions, high-leverage compounds, and response outliers. To confirm that model performance was not driven by chance correlations, Y-randomization tests were performed by permuting the response variable and rebuilding models under identical conditions. Model interpretability was examined using SHAP analysis to quantify the contribution of individual descriptors to predicted pIC₅₀ values [ 14 ]. For downstream virtual screening, predicted compounds were filtered based on applicability domain criteria, and chemical diversity among shortlisted hits was assessed using Tanimoto similarity analysis. Prediction of bioactivity for external compounds To expand the chemical space around the identified QSAR hits, additional candidate molecules were retrieved using scaffold- and similarity-based searches. After removing invalid structures, these molecules were retained as an external screening set. A total of 1,332 structurally related compounds were obtained from Chemspace [ 15 ], along with 29 compounds obtained from the ZINC database [ 16 ] (Table S4). All external compounds were encoded using MACCS fingerprints generated with the same protocol applied during QSAR model development. The final Random Forest QSAR model, trained on the combined training and validation data, was then used to predict pIC₅₀ values for the external library. Feature alignment was maintained by retaining only those MACCS descriptors present in the trained model. The reliability of predictions was assessed using a leverage-based applicability domain approach. Leverage values were calculated for each external compound, and only predictions falling within the defined applicability domain were considered reliable. In-domain compounds with high predicted activity were prioritised and shortlisted for subsequent structure-based analyses. Structure-Based Molecular Docking and Molecular Dynamics Simulations The crystal structure of human SMYD3 was obtained from the Protein Data Bank (PDB ID: 6P6G) [ 17 ], a high-resolution co-crystal complex containing an isoxazole amide inhibitor bound to the catalytic site. This structure was selected based on its experimental quality, lack of mutations, and well-defined binding pocket. Protein preparation was performed using the Maestro Protein Preparation Wizard [ 18 ], including correction of bond orders, assignment of protonation states, addition of missing hydrogens, and restrained energy minimization. Ligand structures, including the QSAR-derived hit, the co-crystallized reference inhibitor, and the selected externally predicted compound, were prepared using LigPrep. Relevant ionization states at physiological pH were generated, and geometries were optimized before docking. Molecular docking was carried out using Glide in extra-precision mode [ 19 ], with the receptor grid centered on the co-crystallized ligand. Docked poses were ranked using Glide XP scores and interaction patterns, and the most favorable binding conformations were selected for further analysis. Molecular dynamics simulations were performed using the Desmond package [ 20 ]. Protein–ligand complexes were parameterized with the OPLS4 force field [ 21 ] and solvated in an explicit TIP3P water model using an orthorhombic box with a 10 Å buffer. Systems were neutralized and adjusted to 0.15 M NaCl. Following energy minimization, equilibration was carried out under NVT and NPT ensembles using standard Desmond protocols. Temperature (300 K) and pressure (1 atm) were maintained. Production simulations were conducted for 250 ns [ 22 ]. Structural stability was assessed using protein and ligand RMSD, while residue flexibility was evaluated using RMSF. Protein–ligand interactions were monitored throughout the trajectory. Relative binding free energies were estimated using the MM-GBSA approach on representative frames. Essential dynamic behavior was explored using principal component analysis, and correlated residue motions were examined through dynamic cross-correlation matrix analysis. Network and Pathway context analysis A protein-protein interaction network centred on SMYD3 was constructed using the STRING database [ 23 ] under Homo sapiens. Interactions were limited to first-shell partners and filtered using a confidence score to retain experimentally supported and curated associations. This restriction avoided excessive network expansion and maintained biological interpretability. To complement the interaction network, gene-disease and gene-pathway associations related to SMYD3 were retrieved from the Comparative Toxicogenomics Database (CTD) [ 24 ]. Functional enrichment analysis was performed using Gene Ontology biological processes and pathway annotations provided by STRING. Enrichment results were screened to retain statistically significant terms related to chromatin regulation, transcriptional control, and epigenetic processes. Results QSAR Model Development and Performance Evaluation QSAR models were developed using four descriptor representations. These were MACCS fingerprints, PubChem fingerprints, CDK descriptors, and Substructure fingerprints. Low-variance filtering was applied to reduce non-informative features. This resulted in 66 MACCS features, 124 PubChem features, 508 CDK features, and 16 Substructure features (Tables S5-S8). The dataset was split into training, validation, and test subsets using a 70:15:15 strategy. Model development and selection were guided by performance on the independent test set. Six regression algorithms were screened across the descriptor spaces. These were Random Forest, Extra Trees, Gradient Boosting, HistGradientBoosting, Ridge regression, and Support Vector Regression. Hyperparameter tuning was performed for the candidate models to improve generalisation. The final selection was based on external predictive performance. Test RMSE was used as the primary ranking metric. Test CCC was used to confirm agreement between predicted and experimental pIC₅₀ values. Test R² and MAE were reported as supporting indicators (Table S9). The MACCS fingerprint-based Random Forest model yielded the best overall predictive performance among the evaluated descriptor algorithm combinations. It achieved a test R² of 0.891 with a test RMSE of 0.447 and a test MAE of 0.257. The corresponding test CCC reached 0.941. This model also showed strong alignment between predicted and observed values, as reflected by a test Pearson correlation of 0.945 (Tables 1 – 2 ). These results supported the selection of MACCS Random Forest as the primary model for downstream analyses (Tables S10-S11). The best PubChem model was obtained using HistGradientBoosting. It reached a test R² of 0.811 with a test RMSE of 0.588 and a test MAE of 0.426. The test CCC was 0.895, and the test Pearson correlation was 0.901. This performance indicated reasonable predictive capacity. It remained consistently weaker than the MACCS-based model when ranked by RMSE and CCC (Table S10). CDK descriptors produced lower generalisation performance in the current setting. The best CDK model was HistGradientBoosting with a test R² of 0.503 and a test RMSE of 0.971. The test MAE was 0.726. The test CCC was 0.669, and the Pearson correlation was 0.709. Substructure fingerprints showed limited predictive strength. HistGradientBoosting was also the best Substructure model, with a test R² of 0.575 and a test RMSE of 0.883. The test MAE was 0.671. The test CCC was 0.734, and the Pearson correlation was 0.761 (Tables 1 – 2 ). These results suggested that the MACCS fingerprint space captured the most informative structural patterns for activity prediction in this dataset (Table S10). Model stability was assessed using multiple random splits. Random seeds ranging from 0 to 9 were evaluated to reduce the likelihood of chance performance driven by a single split. The MACCS models showed consistent behaviour across these splits and across algorithms. The best-performing configuration was obtained under one split, which was then used as the reference setting for interpretability, applicability domain analysis, external prediction, and structure-based validation. Comprehensive results for all descriptor algorithm combinations and all random splits are provided in the Supplementary Information (Table S10). This includes training and validation metrics, external validation statistics, and additional Q² measures. This selection strategy avoided reliance on a single algorithm or a single descriptor representation. It prioritised models that achieved low test error, strong agreement metrics, and stable performance across repeated splits. These considerations supported the use of the MACCS fingerprint-based Random Forest model for subsequent applicability domain analysis, SHAP-based interpretation, external compound screening, and molecular docking and molecular dynamics simulations. Table 1 Training and Validation Performance of QSAR Models Descriptor MACCS PubChem Substructure CDK Algorithm Random Forest HistGradientBoosting HistGradientBoosting HistGradientBoosting Features 66 124 16 508 Train R² 0.968 0.873 0.624 0.837 Train RMSE 0.248 0.491 0.845 0.559 Train MAE 0.144 0.343 0.638 0.409 Val R² 0.815 0.788 0.416 0.334 Val RMSE 0.625 0.669 1.111 1.113 Val MAE 0.361 0.491 0.872 0.841 Table 2 External Test Set Performance and Model Agreement Metrics. Descriptor MACCS PubChem Substructure CDK Algorithm Random Forest HistGradientBoosting HistGradientBoosting HistGradientBoosting Test R² 0.891 0.811 0.575 0.503 Test RMSE 0.447 0.588 0.883 0.971 Test MAE 0.257 0.426 0.671 0.726 Test Pearson r 0.945 0.901 0.761 0.709 Test CCC 0.941 0.895 0.734 0.669 Note Model selection prioritised external predictivity. The MACCS-RF model shows comparable validation and test performance, high CCC values, and consistent Q² metrics, supporting stable generalisation across data splits. Model Robustness and Applicability Domain Assessment We performed a series of validation analyses to examine the robustness, reliability, and practical usability of the selected MACCS-Random Forest model beyond simple performance metrics. These analyses aimed to verify that the observed predictive behaviour was stable across resampling, confined to a well-defined chemical space, and not the result of chance correlations. Model stability was examined using five-fold cross-validation on the training data (1038 compounds, 62 MACCS features). The cross-validated performance remained consistent across folds, yielding a mean R² of 0.743 ± 0.063, a mean RMSE of 0.690 ± 0.068, and a mean MAE of 0.536 ± 0.043. Individual folds showed R² values ranging from 0.674 to 0.791, indicating that no single subset dominated the learning process (Tables S12-S13). This consistency supports the internal stability of the MACCS fingerprint representation under Random Forest modelling and suggests that the model does not rely on a narrow subset of compounds. The applicability domain of the final model was then assessed using leverage statistics and standardized residuals, visualised through a Williams plot. We set the leverage threshold (h*) at 0.15, calculated from our 62 descriptors and the 1,260 compounds in the training set (Fig. 1 and Table S14). Looking across all data splits, 1,410 compounds fell safely within the applicability domain, while 73 were flagged as outliers. A total of 37 compounds showed standardized residuals exceeding ± 3, and an equal number exhibited high leverage values. These cases likely represent response outliers or structurally influential compounds rather than systematic modelling errors. The overall distribution confirms that most predictions fall within a chemically meaningful and statistically defined domain. The robustness of the structure-activity relationship was further evaluated using Y-randomization tests. Fifty randomized models were generated by permuting the response variable while keeping the descriptor matrix unchanged (Fig. 1 ). These randomized models showed a marked loss of predictive power, with a mean R² of − 0.246 ± 0.085, a mean RMSE of 1.607 ± 0.055, and a mean MAE of 1.364 ± 0.052. In contrast, the real MACCS-Random Forest model retained a test R² of 0.776, an RMSE of 0.682, and an MAE of 0.477 when trained on the combined training and validation data and evaluated on the independent test set (Table S15). Chemical consistency among the shortlisted hits was examined using pairwise MACCS-based Tanimoto similarity. The resulting heatmap revealed moderate clustering alongside clear structural diversity, indicating that the high-activity compounds do not belong to a single dominant scaffold class (Fig. 1 and Table S16). This analysis suggests that the model captures generalisable structure-activity relationships rather than relying on redundant chemotypes. Feature Contribution and Model Interpretability Feature-level interpretability was examined for the selected MACCS-Random Forest model using SHAP analysis. Global SHAP importance revealed a clear dominance of a small subset of MACCS fingerprints. MACCSFP64 (A $ A!S), which encodes a ring-connected motif with a non-ring sulfur linkage, emerged as the most influential feature by a large margin, with a mean absolute SHAP value substantially higher than all other descriptors. This indicates that the structural motif encoded by MACCSFP64 plays a central role in driving activity predictions across the dataset. The next tier of influential features included MACCSFP19 (7-membered ring), MACCSFP131 (hetero atoms bearing hydrogen), MACCSFP86 (CH₂–hetero–CH₂ linker motif), and MACCSFP146 (oxygen atoms > 2), each contributing moderate but consistent effects (Fig. 2 and Tables S17–S18). The remaining top-ranked fingerprints showed smaller but consistent contributions, indicating that activity prediction depends on multiple supportive structural signals. The SHAP beeswarm plot shows the effect of these features on the model. For example, MACCSFP64 is a significant driver. The presence of this always shifts predictions to higher pIC 50 values, and its absence lowers predicted activity. This points to a direct relationship between Sulfur containing motifs in proximity to ring systems and the increased potency. A similar but less intense trend was shown for MACCSFP131 and MACCSFP86. These features generally increase the activity predicted, which is consistent with the notion that polar heteroatom linkers aid in molecule stabilization. In contrast, certain features such as MACCSFP125 (aromatic rings > 1) and MACCSFP19 showed bidirectional behaviour, indicating context-dependent effects rather than uniform promotion or suppression of activity. Instead, their effect is determined by the chemical environment in which they are surrounded. Local SHAP explanations were examined for three representative compounds to illustrate model behaviour at the individual level. For the highly active compound CHEMBL4472528 (true pIC₅₀ = 9.00, predicted = 8.71), the prediction was primarily driven by strong positive contributions from MACCSFP64 (+ 0.78), MACCSFP131 (+ 0.29), and MACCSFP86 (+ 0.22) (Figs. 2 – 3 ). These results reveal the function of the sulfur motifs near rings and heteroatom-rich linker regions in concert with each other in the compound scaffold. While a few other features provided smaller boosts of support, only one feature had a small negative effect. This strong balance of positive effects can explain why the predicted activity was so high and why the residual error was so low. The low-activity compound CHEMBL5820763 (true pIC₅₀ = 4.06, predicted = 4.45) showed the opposite pattern (Fig. 3 ). The absence of MACCSFP64 alone produced a large negative SHAP value of -0.93, which shifted the prediction to the low range of activities. Other negative impacts from MACCSFP146, MACCSFP86, and MACCSFP33 further made this trend even stronger. This profile indicates that the molecule lacks the heteroatom and ring motifs that the model is showing for active compounds. Since there were only small positive effects to balance this out, this model predicted low activity with small residual error. On the other hand, the case of CHEMBL5855611 (true pIC 50 = 6.13, predicted = 7.65) is interesting to show the case where the model is struggling. In this case, the effect of MACCSFP64 was a robust positive effect of + 0.64, supported by moderate increases from MACCSFP125 and MACCSFP145. Even with the features such as MACCSFP19 and MACCSFP91 attempting to pull the prediction back down, they were not strong enough to correct the overestimation (Fig. 2 ). This imbalance of strong signals of the positive with weak corrective features explains why the model was too optimistic. This behaviour is consistent with the compound’s elevated residual and supports its classification as a difficult or borderline case rather than a random error. Overall, the SHAP analysis confirms that the MACCS Random Forest model relies on chemically interpretable structural patterns and applies them consistently across the activity range. The dominance of a limited number of fingerprints explains the strong predictive performance, while the presence of competing positive and negative contributions in specific cases provides a transparent rationale for residual errors. Prediction of Bioactivity for External Compounds The final MACCS-Random Forest model was applied to an external compound set to evaluate its predictive utility beyond the curated ChEMBL dataset. Only molecules lying within the predefined applicability domain were retained, using a leverage threshold of h* = 0.15. From the screened library, 492 compounds satisfied this criterion and were considered suitable for reliable prediction (Tables S19-S20). Their predicted pIC₅₀ values clustered around moderate activity (mean = 5.40), with a relatively narrow spread, indicating stable model behaviour when extrapolated to previously unseen chemical space. From this in-domain pool, twenty compounds were prioritised based on higher predicted activity while maintaining low leverage values (Table S21). The predicted pIC 50 values of these top candidates were in the range of 5.99–6.45 (mean = 6.13) and are close to the median activity of the training set. Among these, CSSS00132718709 showed the highest predicted potency (pIC₅₀ = 6.45), followed by CSSS00161201427 (pIC₅₀ = 6.40), CSSS06454646810 (pIC₅₀ = 6.12) and ZINC000022348882 (pIC₅₀ = 6.37). All highlighted compounds exhibited leverage values well below the domain threshold, confirming that their predictions were not driven by structural extrapolation. Molecular Docking Analysis Molecular docking was carried out with the crystallographic structure of the target protein (PDB ID: 6P6G) [ 17 ] to assess the binding behaviour of the selected hit compounds with respect to the co-crystal ligand. The docking scores of CSSS06454646810, CHEMBL4472528, and the co-crystal ligand were − 8.679, -9.993, and − 10.580 kcal/mol, respectively, reflecting favourable binding for all three ligands within the experimentally validated active site. The co-crystal ligand showed the strongest binding affinity, as expected, while CHEMBL4472528 displayed a comparable score, suggesting a closely related interaction profile. The co-crystal ligand reproduced the key crystallographic interactions reported for the 6P6G structure [ 17 ]. Two direct hydrogen bonds were observed, with Thr184 interacting with the oxygen atom of the formamide group and Cys186 interacting with the oxygen atom of the sulfonyl moiety. A protonated NH group formed a salt-bridge interaction with Asp241 at a distance of 3.96 Å, consistent with the acidic recognition motif described in the crystal structure (Fig. 4 ). Several water-mediated interactions further stabilised the complex, including bridging contacts involving Lys297 and the sulfonyl oxygen. Aromatic stabilisation was provided by a π-π stacking interaction between Phe183 and the isoxazole ring. Additional hydrophobic contacts were observed with residues lining the binding tunnel, including Tyr257, Tyr239, Met190, Leu240, Ile237, Cys238, Phe216 , and Val368 . This interaction pattern closely matches the crystallographic binding mode reported for the reference ligand (Figs. 4 – 5 ). CHEMBL4472528 adopted a binding pose that strongly overlapped with the co-crystal ligand and preserved the major anchoring interactions within the active site. Three direct hydrogen bonds were identified, involving Tyr239 and Thr184 with the formamide oxygen, and Tyr257 with the isoxazole oxygen. Water-mediated hydrogen bonds connected the ligand to Lys297 and Ser182 , while the acidic residues Glu192, Asp241 , and Glu294 formed a stabilising electrostatic environment around the protonated regions of the ligand. The binding pocket was further stabilised by extensive hydrophobic interactions with residues such as Cys333, Ile293, Leu290, Met190, Leu240, Cys238, Ile237, Phe216 , and Val368 ( Figs. 4 – 5 ). The preservation of Thr184 anchoring and proximity to the Glu192, Asp241 acidic region indicates that CHEMBL4472528 engages the same critical residues as the crystallographic ligand. CSSS06454646810 also occupied the validated active site and maintained essential interactions, although with a reduced number of direct polar contacts. Two hydrogen bonds were observed, involving Thr184 with the NH group of the isoxazole ring and Gln252 with the NH group of the pyrimidine moiety. A π-π stacking interaction between Phe183 and the isoxazole ring contributed to aromatic stabilisation. A water-mediated interaction connected Asp241 with the formamide NH, and Glu192 provided a stabilising negative electrostatic environment. Hydrophobic interactions were observed with residues including Met190, Val368, Tyr257, Tyr239, Leu240, Cys238, Ile237, Cys180, Ile214 , and Cys186 ( Fig. 4 ). The cofactor SAH509 was positioned near the binding pocket within a 5 Å distance for this complex, suggesting potential proximity effects without directly interfering with ligand binding. Thus, the docking results demonstrate that all three ligands engage the same core binding residues identified in the crystallographic structure of PDB 6P6G. The consistent involvement of Thr184, Asp241, Glu192, Phe183 , and the surrounding hydrophobic tunnel confirms that the predicted binding modes align with experimentally validated interaction patterns [ 17 ]. Molecular Dynamics Simulation and Binding Energy Analysis The reference co-crystal complex maintained a stable protein backbone over the 250 ns simulation. The protein heavy-atom RMSD remained centred at 2.65 ± 0.25 Å, with a narrow interquartile range (Q1-Q3: 2.575–2.792 Å), indicating limited global structural drift. The ligand RMSD indicated moderate variability (2.59 ± 0.93 Å; median 2.97 Å). Based on the trajectory, the molecule did not appear to jump out of place, but rather to undergo a slow movement in position after the halfway point of the simulation. Despite this adjustment, the ligand was accommodated in the binding pocket throughout the simulation. Ligand physicochemical descriptors were stable, including PSA values of 172.8 ± 4.6 A2 and MolSA values of 470.95 ± 5.92, which indicated that there were no significant changes in solvent exposure and molecular conformation during the trajectory (Fig. 6 ) CHEMBL4472528 was just as stable as the reference system but was even more so in some ways. Protein RMSD remained in a steady range (2.84 ± 0.47 Å; median 3.03 Å), whereas the ligand itself exhibited lower and tighter fluctuations than one might anticipate (2.36 ± 0.65 Å; median 2.54 Å; maximum 3.66 Å). The trajectory reflects an initial accommodation phase followed by sustained retention of the ligand within the binding pocket. Both the PSA (172.6 ± 4.4 Å) and radius of gyration (6.39 ± 0.36 Å) were stable, indicating that the molecule maintained its compact shape and had a consistent polar orientation within the cavity (Fig. 6 ). On the other hand, CSSS06454646810 went through an entirely different dynamic path. Protein RMSD values were higher on average (3.27 ± 0.39 Å; median 3.34 Å), and ligand RMSD relative to the protein was substantially elevated (5.54 ± 1.34 Å; median 5.99 Å; maximum 7.25 Å). The trajectory shows that the ligand moved quite a lot at the beginning, eventually searching a much larger range of positions in the binding pocket (Fig. 6 ). The average RMSF values (1.47 ± 0.62 Å) indicate that the protein residues remained relatively constant, as in the other systems. Throughout the simulation, the co-crystal ligand maintained a common network of polar, hydrophobic, and water-mediated interactions. A direct hydrogen bond remained between the oxygen atom of the formamide group and the polar residue Thr184 , in agreement with the crystallographic binding mode. A conserved water molecule mediated the interactions between the Cys238 and the NH group of the isoxazole moiety, and a second water molecule mediated the interactions between Ile214 and the oxygen atom of the sulfur dioxide group. The NH substituted benzene ring was part of a hydrogen bonding network with water involving the negatively charged residues Asp241 and Glu192 , which contributes local electrostatic stabilisation. Aromatic and hydrophobic contacts were preserved, with Phe183 forming a π-π stacking interaction with the isoxazole ring and Tyr239 and Tyr257 providing non-bonded hydrophobic interactions. Secondary structure analysis confirmed preservation of the protein fold, with 39.09% α-helix, 11.37% β-strand, and a total SSE content of 50.46%. CHEMBL4472528 exhibited a well-anchored binding mode supported by multiple direct and water-mediated interactions. Three direct hydrogen bonds were consistently observed, involving Thr184 with the formamide oxygen, Cys238 with the formamide NH group, and the positively charged residue Lys329 with the NH-substituted benzene ring. Lys329 additionally formed a stabilising π-cation interaction with the benzene ring. Water-mediated interactions further reinforced binding stability, including a water bridge linking Met242 to the sulfur dioxide oxygen and another water molecule connecting Asp332 to the NH-substituted benzene ring. A conserved water molecule also linked Cys238 to the NH group of the isoxazole moiety. Hydrophobic contacts involving Phe183, Tyr239 , and Tyr257 were maintained throughout the simulation. The protein secondary structure remained stable, with 39.91% α-helix, 11.19% β-strand, and a total SSE of 51.10%, closely matching the reference system. Principal component analysis (PCA) shows that the dominant protein motions are captured by the first few principal components across all complexes, with ligand-dependent differences in conformational sampling. Compared with the co-crystal ligand, the CHEMBL4472528 and CSSS06454646810 complexes display altered clustering patterns, indicating modulation of collective motions upon ligand binding (Fig. 7 ). Dynamic cross-correlation matrix (DCCM) analysis reveals broadly conserved correlation patterns, with localized changes in correlated and anti-correlated residue motions in the hit compound complexes, suggesting ligand-specific effects on protein dynamics without global destabilization. In contrast, CSSS06454646810 displayed a more limited and less persistent interaction network. A π-π stacking interaction between the benzene ring and Phe183 was retained and represented the primary aromatic stabilisation. Water-mediated hydrogen bonding involved Ile214 , where a water molecule bridged the formamide oxygen, and Thr184 , which interacted with the same water molecule that hydrogen-bonded to the NH group of the formamide. Non-bonded hydrophobic contacts with Tyr239 and Tyr257 were present but fewer and less persistent than those observed for the co-crystal ligand and CHEMBL4472528. Secondary structure analysis showed a modest reduction in overall SSE content (37.18% α-helix, 11.23% β-strand, total SSE 48.41%), consistent with the increased ligand mobility observed in RMSD analyses. MM-GBSA calculations were carried out at the initial docking pose (0 ns) and after 250 ns of MD simulation to evaluate binding energetics over time (Table S22). CHEMBL4472528 showed the most favourable binding free energies among the studied ligands, with ΔG bind values of -103.57 kcal/mol at 0 ns and − 96.27 kcal/mol at 250 ns. Binding was dominated by strong van der Waals (-72.12 to -69.39 kcal/mol) and lipophilic contributions (-50.53 to -46.99 kcal/mol), indicating stable hydrophobic packing within the binding pocket. An increase in electrostatic contributions at 250 ns suggests adaptive optimisation of polar interactions during equilibration. The co-crystal ligand exhibited strong initial binding (ΔG bind = -91.70 kcal/mol at 0 ns), supported by favourable Coulombic and van der Waals interactions. By the 250ns mark, the binding free energy had been reduced to -51.75 kcal/mol. This was mostly because of a decrease in van der Waals contributions and an increase in solvation penalties. This alteration in energy is comparable to the increased mobility of the ligand in the MD trajectory. In contrast, CSSS06454646810 initially showed a higher binding energy of -35.07 kcal/mol. However, the affinity improved significantly during the simulation, reaching − 78.87 kcal/mol by the 250 ns mark. This was probably because of the settling of this molecule into the hydrophobic pocket that enhanced the van der Waals and lipophilic forces. Even so, its binding energy remained less favorable than that of CHEMBL4472528, which makes sense in view of its higher RMSD, lower, and less persistent stabilizing interactions. Network and Pathway Context of SMYD3 Targeting The structural results were put into the context of a wider biological question by analysing the target protein of SMYD3 's interaction network and pathway association. Protein-protein interaction analysis showed that there was a compact network with chromatin-associated proteins at the centre, and with direct connections to histone H3 variants such as H3C12 and H3C13 . The network also included functionally related chromatin regulators, including KMT2E , indicating potential coordination among histone methylation systems [ 25 ]. Molecular chaperones HSP90AA1 and HSP90AB1 were present, reflecting their known role in stabilising epigenetic enzymes and supporting conformational integrity [ 26 ]. Transcription-associated proteins such as ESR1 and HELZ were also connected, reinforcing the involvement of SMYD3 in transcriptional regulation rather than broad signalling pathways [ 11 , 27 ] (Fig. 8 and Table S23). Functional enrichment analysis supported these network observations. Gene Ontology biological process analysis highlighted chromatin organisation, nucleosome assembly, and epigenetic regulation of transcription as the most consistently represented functional categories within the network. Reactome and CTD pathway annotations showed concordant enrichment for chromatin-modifying enzymes and histone methylation processes, while pathways unrelated to nuclear or epigenetic regulation were weakly represented or absent. Disease association analysis using curated CTD annotations indicated strong links between SMYD3 and cancer-related conditions, including liver, breast, and lung malignancies. These associations are consistent with reported roles of SMYD3 in oncogenic transcriptional programs and epigenetic dysregulation. The TF subnetwork further shows that SMYD3 is linked to both activating and inhibitory transcriptional regulators, indicating its involvement in coordinating diverse gene regulatory programs rather than acting through a single pathway (Fig. 8 ). Within this framework, the structure-based prioritisation of CHEMBL4472528 provides a mechanistic context, as stable binding to SMYD3 would be expected to influence epigenetic regulation rather than peripheral signalling networks. Thus, the network and pathway analysis places SMYD3 within an epigenetic and transcription-related functional context. These findings align with the docking and molecular dynamics results, indicating that the observed ligand protein interactions are consistent with the known biological role of SMYD3 and remain within the limits of computational interpretation. Discussion This study presents a coherent computational framework integrating QSAR modelling, structure-based docking, molecular dynamics simulations, and network-level contextualisation to prioritise small-molecule inhibitors of SMYD3 . The approach was designed to balance predictive accuracy, chemical interpretability, and biological relevance. Among the models of QSAR that were evaluated, the most reliable external predictive performance was shown by the MACCS fingerprint using the Random Forest configuration approach. Its superiority over PubChem, CDK, and Substructure representations suggests that the activity landscape of SMYD3 inhibitors is better represented in terms of small structural keys than physicochemical descriptors. Consistent behaviour across multiple, random splits, together with favourable agreement measures on an independent test set, suggests that the chosen model is not based on chance-correlations or artefacts about the dataset they are trained on. Applicability domain and Y-randomization analyses further support this interpretation to prove that the predictive performance is due to the presence of true structure-activity relationships and not the over-fitting of the model. SHAP interpretability analysis helped to get a glimpse of how the model converts chemical structures into predictions of bioactivity. A few MACCS fingerprints, in particular representing carbon-phosphorus connectivity, heterocyclic motifs, and aliphatic connections, dominated the feature importance. The direction and magnitude of these SHAP contributions were consistent for both active and inactive compounds, which is a good indication that the model is internally coherent. Local explanations for specific molecules helped to explain why certain predictions were very accurate while others had larger residuals, which helped to explain the model's decision-making process. This clarity helps to build confidence in the use of the model for hit prioritization. Using external screening within the applicability domain, a list of candidates with predicted activities close to the training median was then identified. Since there were no high-leverage predictions amongst the top hits, the prioritization remained within the confines of known chemical space rather than risky structural extrapolation. This conservative strategy is based on best practices in virtual screening when experimental validation is unavailable. We verified QSAR-derived hits by structure-based analysis using the SMYD3 crystal structure. Docking results showed that CHEMBL4472528 adopts a binding pose closely aligned with the co-crystal ligand, preserving key anchoring interactions involving Thr184, Asp241, Glu192 , and Phe183 . These residues form the core recognition environment of the SMYD3 active site and have been highlighted in prior structural studies of SET-domain methyltransferases [ 17 ]. The consistency between predicted binding modes and experimentally observed interaction patterns supports the chemical plausibility of the prioritised scaffold. Molecular dynamics (MD) simulations were employed to further examine binding stability under explicit solvent conditions. Throughout the 250 ns trajectory, CHEMBL4472528 was stable with low RMSD values and persistent polar, aromatic, and water-mediated interactions. Its interaction network was similar to that of the reference ligand, and MM-GBSA analysis revealed favorable binding energetics with van der Waals and lipophilic contributions as the main driving force. On the other hand, CSSS06454646810 had higher flexibility and weaker persistent interactions, which is consistent with its weaker energetic profile. These results justify a focus on CHEMBL4472528 for a detailed discussion on the binding mode, while at the same time acknowledging other possible chemotypes. Network and pathway analysis indicated that SMYD3 is in a very tight context of epigenetic and transcriptional regulation. Protein-protein interaction analysis revealed significant associations with histone H3 variants, chromatin-modifying enzymes, and molecular chaperones, which is in line with the known role of SMYD3 in chromatin organization [ 28 , 29 ]. Pathway enrichment using Gene Ontology, Reactome, and CTD annotations supported these results, and there was a high emphasis on processes involved in chromatin modification and cancer-associated pathways [ 30 , 31 ]. Significantly, the network was restricted to these areas and did not overlap with unrelated signaling pathways. This specificity implies that our structural results are part of rather than being the result of very diverse, non-specific associations. Thus, the convergence of QSAR performance, interpretable feature contributions, structure-based validation, dynamic stability, and network-level context supports the robustness of the proposed computational workflow. The results position CHEMBL4472528 as a structurally and biologically plausible SMYD3 inhibitor candidate, while demonstrating that integrative computational strategies can provide meaningful mechanistic insight even in the absence of experimental validation. This study, therefore, contributes a reproducible and interpretable framework for epigenetic target exploration using data-driven and structure-informed approaches. Limitations and Future Perspectives This study relies on computational analysis. The predicted inhibitory activity of the prioritised compounds has not been validated using experimental assays. Although the QSAR models were carefully validated using external tests, applicability domain analysis, and Y-randomization, their reliability depends on the chemical space covered by the ChEMBL dataset. Docking and molecular dynamics simulations describe binding stability and interaction patterns but do not directly measure enzymatic inhibition or cellular response. MM-GBSA energies provide relative trends and should not be interpreted as absolute binding affinities. Future work may focus on experimental testing of the top-ranked compounds, particularly CHEMBL4472528 and selected external hits, using SMYD3 inhibition assays. Integrating transcriptomic or epigenetic data following SMYD3 inhibition may also help clarify downstream biological effects and support translational relevance. Conclusion This study presents an integrated computational strategy to prioritise small-molecule inhibitors of SMYD3 by combining QSAR modelling with structure-based and network-level analyses. The MACCS fingerprint-based Random Forest model showed stable performance across validation tests and provided interpretable structure-activity relationships, supporting its use as a reliable screening and prioritisation tool rather than a purely statistical model. Structure-based analysis confirmed that the prioritised compound binds within the experimentally characterised SMYD3 active site and maintains key residue interactions reported in crystallographic studies. Molecular dynamics simulations further supported stable ligand retention and favourable binding energetics over the simulation period, distinguishing CHEMBL4472528 from other candidates. These observations are consistent with the interaction patterns and energetic trends observed for the reference complex. Network and pathway analysis placed SMYD3 within a focused epigenetic and transcription-related context, linking the structural findings to its known biological role. This systems-level view supports the relevance of targeting SMYD3 without extending beyond the scope of computational interpretation. Thus, the results support CHEMBL4472528 as a structurally stable and biologically plausible SMYD3 inhibitor candidate and demonstrate the utility of combining validated QSAR modelling with molecular docking, molecular dynamics, and network analysis for rational inhibitor prioritisation. Declarations Conflicts of Interest: The authors declare no conflicts of interest. Author Contribution Abdullah R. Alzahrani: Conceptualization, methodology, investigation, supervision, and review of the manuscript; Zia Ur Rehman: Methodology, investigation, and manuscript drafting; Talha Jawaid: Investigation, formal analysis, and manuscript drafting; Abida Khan: Conceptualization, investigation, manuscript writing, review and editing, and corresponding author responsibilities. All authors have read and approved the final manuscript and agree to be accountable for the integrity and accuracy of the work. Acknowledgement The authors extend their appreciation to the Deanship of Scientific research at Northern Border University, Arar, KSA for funding this research work through the project number “NBU-FFR-2026–2042–03’. Data Availability Statement: The original contributions presented in this study are included in the article and its supplementary file. References Micallef I, Baron B (2025) Therapeutic Targeting of Protein Lysine and Arginine Methyltransferases: Principles and Strategies for Inhibitor Design. Int J Mol Sci 26:9038. https://doi.org/10.3390/IJMS26189038 Wang Z, Chen X, Zhu C et al (2025) Direct lysine dimethylation of IRF3 by the methyltransferase SMYD3 attenuates antiviral innate immunity. Proc Natl Acad Sci USA 122:e2320644122. https://doi.org/10.1073/PNAS.2320644122 Bernard BJ, Nigam N, Burkitt K, Saloura V (2021) SMYD3: a regulator of epigenetic and signaling pathways in cancer. Clin Epigenetics 13:45. https://doi.org/10.1186/S13148-021-01021-9 Nigam N, Bernard B, Sevilla S et al (2023) SMYD3 represses tumor-intrinsic interferon response in HPV-negative squamous cell carcinoma of the head and neck. Cell Rep 42:112823. https://doi.org/10.1016/J.CELREP.2023.112823 Mazur PK, Gozani O, Sage J, Reynoird N (2016) Novel insights into the oncogenic function of the SMYD3 lysine methyltransferase. Transl Cancer Res 5:330–333. https://doi.org/10.21037/TCR.2016.06.26 Zhao L, Wang Z, Cheng P et al (2025) SMYD3–CDCP1 Axis Drives EMT and CAF Activation in Colorectal Cancer and Is Targetable for Oxaliplatin Sensitization. Biomedicines 13:2737. https://doi.org/10.3390/BIOMEDICINES13112737 Liu Z, Zhao X, Zang M et al (2025) SMYD3 Promotes Immune Evasion in Clear Cell Renal Cell Carcinoma via SREBP1-Mediated Transactivation of CD47. Adv Sci (Weinheim, Baden-Wurttemberg. Ger 12:e04200. https://doi.org/10.1002/ADVS.202404200 Yang Z, Liu F, Li Z et al (2023) Histone lysine methyltransferase SMYD3 promotes oral squamous cell carcinoma tumorigenesis via H3K4me3-mediated HMGA2 transcription. Clin Epigenetics 15:92. https://doi.org/10.1186/S13148-023-01506-9 Yu X, Zhao H, Wang R et al (2024) Cancer epigenetics: from laboratory studies and clinical trials to precision medicine. Cell Death Discov 10:28. https://doi.org/10.1038/s41420-024-01803-z Ji Y, Chen Z, Cai J (2025) Roles and mechanisms of histone methylation in vascular aging and related diseases. Clin Epigenetics 17:35. https://doi.org/10.1186/S13148-025-01842-Y Mo L, Deng M, Adhav R et al (2025) Oncogenic activation of SMYD3-SHCBP1 promotes breast cancer development and is coupled with resistance to immune therapy. Cell Death Dis 16:220. https://doi.org/10.1038/S41419-025-07570-8 Fasano C, Lepore Signorile M, De Marco K et al (2022) Identifying novel SMYD3 interactors on the trail of cancer hallmarks. Comput Struct Biotechnol J 20:1860–1875. https://doi.org/10.1016/J.CSBJ.2022.03.037 Yap CW (2011) PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32:1466–1474. https://doi.org/10.1002/JCC.21707 Qurat Ul Ain S, Islam Rather KU (2025) Integrated statistical modeling and machine learning techniques with SHAP for epidemiological data analysis. Ann Epidemiol 108:85–91. https://doi.org/10.1016/J.ANNEPIDEM.2025.06.012 Du Y, Liu X, Shah N et al (2022) ChemSpacE: Interpretable and Interactive Chemical Space Exploration. ChemRxiv. https://doi.org/10.26434/CHEMRXIV-2022-X49MH-V3 Tingle BI, Tang KG, Castanon M et al (2023) ZINC-22—A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. J Chem Inf Model 63:1166. https://doi.org/10.1021/ACS.JCIM.2C01253 Su DS, Qu J, Schulz M et al (2019) Discovery of Isoxazole Amides as Potent and Selective SMYD3 Inhibitors. ACS Med Chem Lett 11:133–140. https://doi.org/10.1021/ACSMEDCHEMLETT.9B00493 Madhavi Sastry G, Adzhigirey M, Day T et al (2013) Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. J Comput Aided Mol Des 27:221–234. https://doi.org/10.1007/S10822-013-9644-8 Friesner RA, Banks JL, Murphy RB et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739–1749. https://doi.org/10.1021/JM0306430 Bowers KJ, Chow E, Xu H et al (2006) Scalable algorithms for molecular dynamics simulations on commodity clusters. Proc 2006. https://doi.org/10.1145/1188455.1188544 . ACM/IEEE Conf Supercomput SC’06 Lu C, Wu C, Ghoreishi D et al (2021) OPLS4: Improving Force Field Accuracy on Challenging Regimes of Chemical Space. J Chem Theory Comput 17:4291–4300. https://doi.org/10.1021/ACS.JCTC.1C00302 Manaithiya A, Bhowmik R, Acharjee S et al (2024) Elucidating molecular mechanism and chemical space of chalcones through biological networks and machine learning approaches. Comput Struct Biotechnol J 23:2811–2836. https://doi.org/10.1016/J.CSBJ.2024.07.006 Szklarczyk D, Nastou K, Koutrouli M et al (2025) The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Res 53:D730–D737. https://doi.org/10.1093/NAR/GKAE1113 Davis AP, Wiegers TC, Sciaky D et al (2025) Comparative Toxicogenomics Database’s 20th anniversary: update 2025. Nucleic Acids Res 53:D1328–D1334. https://doi.org/10.1093/NAR/GKAE883 Bochyńska A, Lüscher-Firzlaff J, Lüscher B (2018) Modes of Interaction of KMT2 Histone H3 Lysine 4 Methyltransferase/COMPASS Complexes with Chromatin. Cells 7:17. https://doi.org/10.3390/CELLS7030017 Abu-Farha M, Lanouette S, Elisma F et al (2011) Proteomic analyses of the SMYD family interactomes identify HSP90 as a novel target for SMYD2. J Mol Cell Biol 3:301–308. https://doi.org/10.1093/JMCB/MJR025 Hamamoto R, Furukawa Y, Morita M et al (2004) SMYD3 encodes a histone methyltransferase involved in the proliferation of cancer cells. Nat Cell Biol 6:731–740. https://doi.org/10.1038/ncb1151 Silva FP, Hamamoto R, Kunizaki M et al (2007) Enhanced methyltransferase activity of SMYD3 by the cleavage of its N-terminal region in human cancer cells. Oncogene 27:2686–2692. https://doi.org/10.1038/sj.onc.1210929 Ding Q, Cai J, Jin L et al (2024) A novel small molecule ZYZ384 targeting SMYD3 for hepatocellular carcinoma via reducing H3K4 trimethylation of the Rac1 promoter. MedComm 5:e711. https://doi.org/10.1002/MCO2.711 Foreman KW, Brown M, Park F et al (2011) Structural and Functional Profiling of the Human Histone Methyltransferase SMYD3. PLoS ONE 6:e22290. https://doi.org/10.1371/JOURNAL.PONE.0022290 Fu W, Liu N, Qiao Q et al (2016) Structural basis for substrate preference of SMYD3, a SET domain-containing protein lysine methyltransferase. J Biol Chem 291:9173–9180. https://doi.org/10.1074/jbc.M115.709832 Additional Declarations No competing interests reported. Supplementary Files SupplementaryfileSMYD3.xlsx Cite Share Download PDF Status: Published Journal Publication published 10 Apr, 2026 Read the published version in Molecular Diversity → Version 1 posted Editorial decision: Revision requested 02 Mar, 2026 Reviews received at journal 28 Feb, 2026 Reviews received at journal 22 Feb, 2026 Reviewers agreed at journal 13 Feb, 2026 Reviewers agreed at journal 09 Feb, 2026 Reviewers invited by journal 22 Jan, 2026 Editor assigned by journal 22 Jan, 2026 Submission checks completed at journal 22 Jan, 2026 First submitted to journal 21 Jan, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8662415","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":579185989,"identity":"4d55f3e2-2a2c-43e2-b2ed-22e2b46b592d","order_by":0,"name":"Abdullah R. Alzahrani","email":"","orcid":"","institution":"Umm Al-Qura University","correspondingAuthor":false,"prefix":"","firstName":"Abdullah","middleName":"R.","lastName":"Alzahrani","suffix":""},{"id":579185990,"identity":"01675e04-4d51-4eb4-8028-fa88d5c76786","order_by":1,"name":"Zia Ur Rehman","email":"","orcid":"","institution":"Jazan University","correspondingAuthor":false,"prefix":"","firstName":"Zia","middleName":"Ur","lastName":"Rehman","suffix":""},{"id":579185995,"identity":"22dc4e1a-0674-482c-84d4-3dfd8cc41baf","order_by":2,"name":"Talha Jawaid","email":"","orcid":"","institution":"Al Imam Mohammad Ibn Saud Islamic University (IMSIU)","correspondingAuthor":false,"prefix":"","firstName":"Talha","middleName":"","lastName":"Jawaid","suffix":""},{"id":579185999,"identity":"7922f444-2e44-4b04-baa7-0985f8ca2b6b","order_by":3,"name":"Abida Khan","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABB0lEQVRIiWNgGAWjYDACCTBiYDAA89hsIAweErSkka7lMGEt/LObD974wbBNzpz9dOqGD2Xn88wlEhgfvG1jsOdvwGHJnWPJlj0Mt40te3K33Zxx7nax5YwEZsO5bQyJMw7gsOZGjpkED8PtxA0Hcrfd5m0DMm4ksEnztjEkMODQIn8j/5vkH5CW82+33f7bdg6khf03UIu9PA4tBjdy2KTBttwA2sLYdgBsCzNQC+MGHFoM7xwztpYxuG1scOPttps955ITd/Y8bJacc04icSMOLXK3mx/efFNxW87gfO62Gz/K7BK3sycf/PCmzMZeDpf3Ic5D4TE2MEAjaxSMglEwCkYBmQAAv95mI/vRZY4AAAAASUVORK5CYII=","orcid":"","institution":"Northern Border University","correspondingAuthor":true,"prefix":"","firstName":"Abida","middleName":"","lastName":"Khan","suffix":""}],"badges":[],"createdAt":"2026-01-21 17:24:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8662415/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8662415/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s11030-026-11533-2","type":"published","date":"2026-04-10T15:58:35+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":101297445,"identity":"de5890c0-7874-47d1-88c5-9960b5d1bb76","added_by":"auto","created_at":"2026-01-28 09:27:14","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":305202,"visible":true,"origin":"","legend":"\u003cp\u003eValidation and chemical space assessment of the MACCS-RF model using SHAP-based heatmaps, Tanimoto similarity, William’s plot, leverage analysis, and Y-randomization.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/5038fae6a110abbe3ea2181b.png"},{"id":101298035,"identity":"cf017413-b850-4ebd-a155-a31fea3e391f","added_by":"auto","created_at":"2026-01-28 09:29:53","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":414067,"visible":true,"origin":"","legend":"\u003cp\u003eLocal and global SHAP analysis of the MACCS–RF model. Waterfall plots show feature contributions for representative high-activity (CHEMBL4472528), low-activity (CHEMBL5820763), and problematic (CHEMBL5855611) compounds, while beeswarm and mean SHAP plots highlight the most influential MACCS fingerprints driving activity predictions.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/2e404c1254f103207b040d3f.png"},{"id":101285660,"identity":"ab55a612-cbee-4ee8-a5e6-7ed8ecb71635","added_by":"auto","created_at":"2026-01-28 06:36:21","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":239201,"visible":true,"origin":"","legend":"\u003cp\u003eChemical structures of hit compounds and SAR interpretation based on MACCS fingerprint features identified by SHAP analysis.\u003c/p\u003e","description":"","filename":"3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/38c423c83f4a17eeebfe1065.jpg"},{"id":101298113,"identity":"e537f02a-9100-4b2e-91de-e994e63c2fb8","added_by":"auto","created_at":"2026-01-28 09:30:26","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":893422,"visible":true,"origin":"","legend":"\u003cp\u003e3D-Binding mode and interaction profiles of the hit compound CHEMBL4472528, CSSS06454646810, compared with the co-crystal ligand within the target binding pocket. Surface views illustrate the accommodation of the hit molecule within the active-site cavity.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/d28b857ea7c2012607b44e94.png"},{"id":101297838,"identity":"ccbe4f31-e10f-41ca-b039-7e94758b6f37","added_by":"auto","created_at":"2026-01-28 09:29:03","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":543223,"visible":true,"origin":"","legend":"\u003cp\u003e2D-Binding mode and interaction profiles of the hit compound CHEMBL4472528, CSSS06454646810, compared with the co-crystal ligand within the target binding pocket.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/3d574e5bfa8d8eb7f2f9a9d5.png"},{"id":101297923,"identity":"c1c1f2ee-86fa-4044-b798-df1021d16dc4","added_by":"auto","created_at":"2026-01-28 09:29:20","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":741574,"visible":true,"origin":"","legend":"\u003cp\u003eMolecular dynamics stability and flexibility analysis of the target protein in complex with CHEMBL4472528, CSSS06454646810, and the co-crystal ligand over 250 ns. Time-dependent protein and ligand RMSD profiles assess complex stability, while residue-wise RMSF and backbone sidechain RMSF comparisons highlight local flexibility patterns across the protein.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/46e74075a450b73e112bfcb4.png"},{"id":101297843,"identity":"e1004871-b116-4ebb-8193-cc26435ad344","added_by":"auto","created_at":"2026-01-28 09:29:04","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":2637962,"visible":true,"origin":"","legend":"\u003cp\u003ePrincipal component analysis (PCA) and dynamic cross-correlation matrix (DCCM) derived from 250 ns molecular dynamics simulations. (A-B) Co-crystal ligand complex, (C-D) CHEMBL4472528 complex, and (E-F) CSSS06454646810 complex. PCA plots illustrate dominant collective motions of the protein, while DCCM maps depict residue-residue correlated and anti-correlated motions across each complex.\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/9a92acfe3bf27aeb71840a14.png"},{"id":101298032,"identity":"359838b3-105e-4874-91f2-a861bd13842b","added_by":"auto","created_at":"2026-01-28 09:29:53","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":293981,"visible":true,"origin":"","legend":"\u003cp\u003eProtein–protein interaction network and transcription factor (TF) regulatory subnetwork centered on SMYD3.\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/52c086cdff9d527ae8489fdf.png"},{"id":106809356,"identity":"37a0f448-379d-475e-b299-18f554bf5a2a","added_by":"auto","created_at":"2026-04-13 16:10:04","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":6803460,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/06b5ff36-139d-49e2-b40f-3a4b047a7653.pdf"},{"id":101297983,"identity":"75916677-7333-41ea-a8ef-32f184612c44","added_by":"auto","created_at":"2026-01-28 09:29:32","extension":"xlsx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":6600746,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryfileSMYD3.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8662415/v1/fd1b6fc6bd92028e596af053.xlsx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Predictive Bioactivity Modeling and Structural Binding Analysis for the Identification of Potential SMYD3 Modulators","fulltext":[{"header":"Introduction","content":"\u003cp\u003eEpigenetic regulation plays a central role in shaping transcriptional programs that govern cellular identity, metabolism, and disease progression. Among epigenetic writers, protein lysine methyltransferases influence chromatin organization through site-specific histone methylation, thereby modulating transcriptional output without altering DNA sequence [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Dysregulation of these enzymes has been widely implicated in cancer, where altered histone methylation supports oncogenic transcriptional states and adaptive cellular phenotypes [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e, \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. \u003cem\u003eSMYD3\u003c/em\u003e (SET and MYND domain-containing protein 3) has emerged as a context-dependent histone methyltransferase that primarily regulates transcriptional activity rather than acting as a global chromatin modifier [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. Accumulating evidence indicates that \u003cem\u003eSMYD3\u003c/em\u003e contributes to disease-associated transcriptional amplification by modulating oncogenic signaling pathways, metabolic reprogramming, and stress-responsive gene expression programs. Unlike broadly acting epigenetic enzymes, \u003cem\u003eSMYD3\u003c/em\u003e appears to exert its effects in a pathway-selective manner, which makes it an attractive but still incompletely explored target for chemical modulation [\u003cspan additionalcitationids=\"CR6\" citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eStructural and biochemical studies have revealed that \u003cem\u003eSMYD3\u003c/em\u003e has a well-defined catalytic pocket that can accommodate small molecules that interfere with substrate recognition and methyltransferase activity [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Several \u003cem\u003eSMYD3\u003c/em\u003e inhibitors have been identified via experimental screening and structure-guided medicinal chemistry efforts, and curated inhibitory bioactivity data are now available to the public [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e, \u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. However, these data have yet to be exploited systematically using ligand-based quantitative analysis of activity to identify chemical features that control inhibitory potency in diverse scaffolds. In parallel, \u003cem\u003eSMYD3\u003c/em\u003e has been positioned within a compact epigenetic and transcriptional regulatory network involving histone H3 variants, chromatin-associated methyltransferases, and transcription-related cofactors [\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e, \u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e, \u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. Pathway and disease association analyses consistently link \u003cem\u003eSMYD3\u003c/em\u003e to cancer-related phenotypes, reinforcing its relevance in oncogenic transcriptional control [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e]. Despite this growing biological understanding, integrative studies connecting chemical determinants of \u003cem\u003eSMYD3\u003c/em\u003e inhibition with systems-level biological context remain limited. Chemical modulation of \u003cem\u003eSMYD3\u003c/em\u003e is therefore expected to interfere with substrate recognition and methyltransferase activity, providing a rational basis for targeting its transcriptional regulatory function. This gap constrains the rational prioritisation of candidate modulators and hinders the interpretation of how chemical inhibition may translate into biologically relevant effects.\u003c/p\u003e \u003cp\u003eThe present study addresses this gap by applying a quantitative activity analysis framework to characterise and prioritise small-molecule \u003cem\u003eSMYD3\u003c/em\u003e modulators using curated inhibitory bioactivity data. Predictive models were developed to identify key molecular features associated with \u003cem\u003eSMYD3\u003c/em\u003e inhibition and were applied to external compounds within a defined applicability domain. Structure-guided analysis and molecular dynamics simulations based on an experimentally validated \u003cem\u003eSMYD3\u003c/em\u003e structure were then used to assess binding stability and mechanistic plausibility. Finally, network and pathway analysis were employed to place \u003cem\u003eSMYD3\u003c/em\u003e targeting within its biological context. This whole systems approach brings a clear, mechanistically based perspective on potential SMYD3 modulators while staying well within the confines of computational inference.\u003c/p\u003e"},{"header":"Methods and Materials","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData collection and preprocessing\u003c/h2\u003e \u003cp\u003eWe pulled the \u003cem\u003eSMYD3\u003c/em\u003e bioactivity data from the ChEMBL database (target ID: CHEMBL2321643; UniProt accession: Q9H7B4). Only small-molecule inhibitors with experimentally reported IC₅₀ (nanomolar) values were considered. Starting with an initial number of 1,873 records, we transformed all activity values into pIC\u003csub\u003e50\u003c/sub\u003e to make it easier to model regressions (Table \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e). During the curation process, we removed duplicates, inconsistent measurements, and any records with missing or unclear activity data. This resulted in a final set of 1,483 unique compounds to run for a QSAR analysis (Table S2). For descriptive purposes, compounds were grouped into activity classes based on commonly used pIC₅₀ thresholds. Molecules with pIC₅₀ values of 7.0 or higher were classified as active, those with pIC₅₀ values below 6.0 were labelled inactive, and compounds falling between these thresholds were considered intermediate (Table S3).\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eMolecular representation and descriptor preprocessing\u003c/h3\u003e\n\u003cp\u003eMolecular descriptors were generated to represent chemical structures in a machine-learning compatible format. Fingerprints and descriptor matrices were computed using a PaDEL-based workflow [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], including four descriptor spaces: MACCS keys, PubChem fingerprints, CDK descriptors, and Substructure fingerprints. Each descriptor set was produced using identical compound ordering to ensure that the same molecules were consistently aligned across all representations. To improve model stability and reduce noise from non-informative variables, a low-variance feature filtering step was applied independently to each descriptor set. After variance-based removal, the final feature counts were: 62 features for MACCS, 120 features for PubChem, 12 features for Substructure, and 504 features for CDK. These reduced descriptor matrices were used in downstream QSAR modelling and comparative evaluation.\u003c/p\u003e\n\u003ch3\u003eQSAR model development and validation\u003c/h3\u003e\n\u003cp\u003eQSAR models were developed using a supervised regression framework to predict pIC₅₀ values based on multiple molecular descriptor sets, including MACCS, PubChem, CDK, and Substructure fingerprints. The curated dataset was divided using a two-step random splitting strategy, with 70% of compounds assigned to the training set and the remaining 30% further split equally into validation (15%) and independent test (15%) sets. All splits were performed with random shuffling and fixed seeds to ensure reproducibility. Several machine learning algorithms were evaluated across descriptor spaces, including Random Forest, Extra Trees, Gradient Boosting Regressor, Histogram-based Gradient Boosting Regressor, Ridge Regression with scaling, and Support Vector Regression. Tree-based methods were primarily employed to capture non-linear structure-activity relationships, while linear and kernel-based models were included for comparison. Descriptor matrices were converted to numeric format, and missing values were handled appropriately before model fitting.\u003c/p\u003e \u003cp\u003eWe tested the model's performance on the training, validation, and test datasets using standard regression metrics. To ensure that the model is generalizing well to new data, we measured the external predictive power of the model using Q\u0026sup2;\u003csub\u003eext\u003c/sub\u003e and the Tropsha criteria (Q\u0026sup2;\u003csub\u003eF1\u003c/sub\u003e, Q\u0026sup2;\u003csub\u003eF2\u003c/sub\u003e, and Q\u0026sup2;\u003csub\u003eF3\u003c/sub\u003e). We retained the predictions from each descriptor-model combination and complete compound metadata. This allowed us to go straight into applicability domain analysis and virtual screening with all the necessary contexts.\u003c/p\u003e\n\u003ch3\u003eModel selection and robustness analysis\u003c/h3\u003e\n\u003cp\u003eWe systematically compared QSAR models to obtain the best descriptor-algorithmic combinations. We prioritized the model selection based on the performance of the test sets, so the models were ranked according to the RMSE and CCC, in order of their performance. This approach guarantees that the final model has a low prediction error as well as a strong agreement between the observed and predicted activities. In addition to overall ranking, the best-performing model within each descriptor space was identified to enable descriptor-wise comparison. Model robustness was assessed using five-fold cross-validation on the training data, with performance reported as mean and standard deviation values to evaluate stability across different data partitions. The applicability domain of the selected models was defined using a leverage-based approach and visualised through Williams plots, allowing identification of in-domain predictions, high-leverage compounds, and response outliers. To confirm that model performance was not driven by chance correlations, Y-randomization tests were performed by permuting the response variable and rebuilding models under identical conditions. Model interpretability was examined using SHAP analysis to quantify the contribution of individual descriptors to predicted pIC₅₀ values [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. For downstream virtual screening, predicted compounds were filtered based on applicability domain criteria, and chemical diversity among shortlisted hits was assessed using Tanimoto similarity analysis.\u003c/p\u003e\n\u003ch3\u003ePrediction of bioactivity for external compounds\u003c/h3\u003e\n\u003cp\u003eTo expand the chemical space around the identified QSAR hits, additional candidate molecules were retrieved using scaffold- and similarity-based searches. After removing invalid structures, these molecules were retained as an external screening set. A total of 1,332 structurally related compounds were obtained from Chemspace [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e], along with 29 compounds obtained from the ZINC database [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] (Table S4). All external compounds were encoded using MACCS fingerprints generated with the same protocol applied during QSAR model development. The final Random Forest QSAR model, trained on the combined training and validation data, was then used to predict pIC₅₀ values for the external library. Feature alignment was maintained by retaining only those MACCS descriptors present in the trained model. The reliability of predictions was assessed using a leverage-based applicability domain approach. Leverage values were calculated for each external compound, and only predictions falling within the defined applicability domain were considered reliable. In-domain compounds with high predicted activity were prioritised and shortlisted for subsequent structure-based analyses.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eStructure-Based Molecular Docking and Molecular Dynamics Simulations\u003c/h2\u003e \u003cp\u003eThe crystal structure of human \u003cem\u003eSMYD3\u003c/em\u003e was obtained from the Protein Data Bank (PDB ID: 6P6G) [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e], a high-resolution co-crystal complex containing an isoxazole amide inhibitor bound to the catalytic site. This structure was selected based on its experimental quality, lack of mutations, and well-defined binding pocket. Protein preparation was performed using the Maestro Protein Preparation Wizard [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e], including correction of bond orders, assignment of protonation states, addition of missing hydrogens, and restrained energy minimization. Ligand structures, including the QSAR-derived hit, the co-crystallized reference inhibitor, and the selected externally predicted compound, were prepared using LigPrep. Relevant ionization states at physiological pH were generated, and geometries were optimized before docking. Molecular docking was carried out using Glide in extra-precision mode [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e], with the receptor grid centered on the co-crystallized ligand. Docked poses were ranked using Glide XP scores and interaction patterns, and the most favorable binding conformations were selected for further analysis.\u003c/p\u003e \u003cp\u003eMolecular dynamics simulations were performed using the Desmond package [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Protein\u0026ndash;ligand complexes were parameterized with the OPLS4 force field [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] and solvated in an explicit TIP3P water model using an orthorhombic box with a 10 \u0026Aring; buffer. Systems were neutralized and adjusted to 0.15 M NaCl. Following energy minimization, equilibration was carried out under NVT and NPT ensembles using standard Desmond protocols. Temperature (300 K) and pressure (1 atm) were maintained. Production simulations were conducted for 250 ns [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. Structural stability was assessed using protein and ligand RMSD, while residue flexibility was evaluated using RMSF. Protein\u0026ndash;ligand interactions were monitored throughout the trajectory. Relative binding free energies were estimated using the MM-GBSA approach on representative frames. Essential dynamic behavior was explored using principal component analysis, and correlated residue motions were examined through dynamic cross-correlation matrix analysis.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eNetwork and Pathway context analysis\u003c/h3\u003e\n\u003cp\u003eA protein-protein interaction network centred on \u003cem\u003eSMYD3\u003c/em\u003e was constructed using the STRING database [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e] under Homo sapiens. Interactions were limited to first-shell partners and filtered using a confidence score to retain experimentally supported and curated associations. This restriction avoided excessive network expansion and maintained biological interpretability. To complement the interaction network, gene-disease and gene-pathway associations related to \u003cem\u003eSMYD3\u003c/em\u003e were retrieved from the Comparative Toxicogenomics Database (CTD) [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. Functional enrichment analysis was performed using Gene Ontology biological processes and pathway annotations provided by STRING. Enrichment results were screened to retain statistically significant terms related to chromatin regulation, transcriptional control, and epigenetic processes.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e \u003ch2\u003eQSAR Model Development and Performance Evaluation\u003c/h2\u003e \u003cp\u003eQSAR models were developed using four descriptor representations. These were MACCS fingerprints, PubChem fingerprints, CDK descriptors, and Substructure fingerprints. Low-variance filtering was applied to reduce non-informative features. This resulted in 66 MACCS features, 124 PubChem features, 508 CDK features, and 16 Substructure features (Tables S5-S8). The dataset was split into training, validation, and test subsets using a 70:15:15 strategy. Model development and selection were guided by performance on the independent test set. Six regression algorithms were screened across the descriptor spaces. These were Random Forest, Extra Trees, Gradient Boosting, HistGradientBoosting, Ridge regression, and Support Vector Regression. Hyperparameter tuning was performed for the candidate models to improve generalisation. The final selection was based on external predictive performance. Test RMSE was used as the primary ranking metric. Test CCC was used to confirm agreement between predicted and experimental pIC₅₀ values. Test R\u0026sup2; and MAE were reported as supporting indicators (Table S9).\u003c/p\u003e \u003cp\u003eThe MACCS fingerprint-based Random Forest model yielded the best overall predictive performance among the evaluated descriptor algorithm combinations. It achieved a test R\u0026sup2; of 0.891 with a test RMSE of 0.447 and a test MAE of 0.257. The corresponding test CCC reached 0.941. This model also showed strong alignment between predicted and observed values, as reflected by a test Pearson correlation of 0.945 (Tables\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). These results supported the selection of MACCS Random Forest as the primary model for downstream analyses (Tables S10-S11). The best PubChem model was obtained using HistGradientBoosting. It reached a test R\u0026sup2; of 0.811 with a test RMSE of 0.588 and a test MAE of 0.426. The test CCC was 0.895, and the test Pearson correlation was 0.901. This performance indicated reasonable predictive capacity. It remained consistently weaker than the MACCS-based model when ranked by RMSE and CCC (Table S10).\u003c/p\u003e \u003cp\u003eCDK descriptors produced lower generalisation performance in the current setting. The best CDK model was HistGradientBoosting with a test R\u0026sup2; of 0.503 and a test RMSE of 0.971. The test MAE was 0.726. The test CCC was 0.669, and the Pearson correlation was 0.709. Substructure fingerprints showed limited predictive strength. HistGradientBoosting was also the best Substructure model, with a test R\u0026sup2; of 0.575 and a test RMSE of 0.883. The test MAE was 0.671. The test CCC was 0.734, and the Pearson correlation was 0.761 (Tables\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). These results suggested that the MACCS fingerprint space captured the most informative structural patterns for activity prediction in this dataset (Table S10). Model stability was assessed using multiple random splits. Random seeds ranging from 0 to 9 were evaluated to reduce the likelihood of chance performance driven by a single split. The MACCS models showed consistent behaviour across these splits and across algorithms. The best-performing configuration was obtained under one split, which was then used as the reference setting for interpretability, applicability domain analysis, external prediction, and structure-based validation. Comprehensive results for all descriptor algorithm combinations and all random splits are provided in the Supplementary Information (Table S10). This includes training and validation metrics, external validation statistics, and additional Q\u0026sup2; measures.\u003c/p\u003e \u003cp\u003eThis selection strategy avoided reliance on a single algorithm or a single descriptor representation. It prioritised models that achieved low test error, strong agreement metrics, and stable performance across repeated splits. These considerations supported the use of the MACCS fingerprint-based Random Forest model for subsequent applicability domain analysis, SHAP-based interpretation, external compound screening, and molecular docking and molecular dynamics simulations.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eTraining and Validation Performance of QSAR Models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDescriptor\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMACCS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePubChem\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSubstructure\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCDK\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHistGradientBoosting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHistGradientBoosting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eHistGradientBoosting\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFeatures\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e66\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e124\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e508\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTrain R\u0026sup2;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.968\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.873\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.624\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.837\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTrain RMSE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.248\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.491\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.845\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.559\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTrain MAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.144\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.343\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.638\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.409\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVal R\u0026sup2;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.815\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.788\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.416\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.334\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVal RMSE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.625\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.669\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.111\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1.113\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVal MAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.361\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.491\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.872\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.841\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eExternal Test Set Performance and Model Agreement Metrics.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDescriptor\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMACCS\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePubChem\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSubstructure\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCDK\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eAlgorithm\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRandom Forest\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHistGradientBoosting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eHistGradientBoosting\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eHistGradientBoosting\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTest R\u0026sup2;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.891\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.811\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.575\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.503\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTest RMSE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.447\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.588\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.883\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.971\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTest MAE\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.257\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.426\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.671\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.726\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTest Pearson r\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.945\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.901\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.761\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.709\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eTest CCC\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.941\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.895\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.734\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.669\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNote\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colspan=\"4\" nameend=\"c5\" namest=\"c2\"\u003e \u003cp\u003eModel selection prioritised external predictivity. The MACCS-RF model shows comparable validation and test performance, high CCC values, and consistent Q\u0026sup2; metrics, supporting stable generalisation across data splits.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eModel Robustness and Applicability Domain Assessment\u003c/h2\u003e \u003cp\u003eWe performed a series of validation analyses to examine the robustness, reliability, and practical usability of the selected MACCS-Random Forest model beyond simple performance metrics. These analyses aimed to verify that the observed predictive behaviour was stable across resampling, confined to a well-defined chemical space, and not the result of chance correlations. Model stability was examined using five-fold cross-validation on the training data (1038 compounds, 62 MACCS features). The cross-validated performance remained consistent across folds, yielding a mean R\u0026sup2; of 0.743\u0026thinsp;\u0026plusmn;\u0026thinsp;0.063, a mean RMSE of 0.690\u0026thinsp;\u0026plusmn;\u0026thinsp;0.068, and a mean MAE of 0.536\u0026thinsp;\u0026plusmn;\u0026thinsp;0.043. Individual folds showed R\u0026sup2; values ranging from 0.674 to 0.791, indicating that no single subset dominated the learning process (Tables S12-S13). This consistency supports the internal stability of the MACCS fingerprint representation under Random Forest modelling and suggests that the model does not rely on a narrow subset of compounds.\u003c/p\u003e \u003cp\u003eThe applicability domain of the final model was then assessed using leverage statistics and standardized residuals, visualised through a Williams plot. We set the leverage threshold (h*) at 0.15, calculated from our 62 descriptors and the 1,260 compounds in the training set (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Table S14). Looking across all data splits, 1,410 compounds fell safely within the applicability domain, while 73 were flagged as outliers. A total of 37 compounds showed standardized residuals exceeding\u0026thinsp;\u0026plusmn;\u0026thinsp;3, and an equal number exhibited high leverage values. These cases likely represent response outliers or structurally influential compounds rather than systematic modelling errors. The overall distribution confirms that most predictions fall within a chemically meaningful and statistically defined domain.\u003c/p\u003e \u003cp\u003eThe robustness of the structure-activity relationship was further evaluated using Y-randomization tests. Fifty randomized models were generated by permuting the response variable while keeping the descriptor matrix unchanged (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). These randomized models showed a marked loss of predictive power, with a mean R\u0026sup2; of \u0026minus;\u0026thinsp;0.246\u0026thinsp;\u0026plusmn;\u0026thinsp;0.085, a mean RMSE of 1.607\u0026thinsp;\u0026plusmn;\u0026thinsp;0.055, and a mean MAE of 1.364\u0026thinsp;\u0026plusmn;\u0026thinsp;0.052. In contrast, the real MACCS-Random Forest model retained a test R\u0026sup2; of 0.776, an RMSE of 0.682, and an MAE of 0.477 when trained on the combined training and validation data and evaluated on the independent test set (Table S15). Chemical consistency among the shortlisted hits was examined using pairwise MACCS-based Tanimoto similarity. The resulting heatmap revealed moderate clustering alongside clear structural diversity, indicating that the high-activity compounds do not belong to a single dominant scaffold class (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Table S16). This analysis suggests that the model captures generalisable structure-activity relationships rather than relying on redundant chemotypes.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eFeature Contribution and Model Interpretability\u003c/h2\u003e \u003cp\u003eFeature-level interpretability was examined for the selected MACCS-Random Forest model using SHAP analysis. Global SHAP importance revealed a clear dominance of a small subset of MACCS fingerprints. MACCSFP64 (A\u003cspan\u003e$\u003c/span\u003eA!S), which encodes a ring-connected motif with a non-ring sulfur linkage, emerged as the most influential feature by a large margin, with a mean absolute SHAP value substantially higher than all other descriptors. This indicates that the structural motif encoded by MACCSFP64 plays a central role in driving activity predictions across the dataset. The next tier of influential features included MACCSFP19 (7-membered ring), MACCSFP131 (hetero atoms bearing hydrogen), MACCSFP86 (CH₂\u0026ndash;hetero\u0026ndash;CH₂ linker motif), and MACCSFP146 (oxygen atoms\u0026thinsp;\u0026gt;\u0026thinsp;2), each contributing moderate but consistent effects (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e and Tables S17\u0026ndash;S18). The remaining top-ranked fingerprints showed smaller but consistent contributions, indicating that activity prediction depends on multiple supportive structural signals.\u003c/p\u003e \u003cp\u003eThe SHAP beeswarm plot shows the effect of these features on the model. For example, MACCSFP64 is a significant driver. The presence of this always shifts predictions to higher pIC\u003csub\u003e50\u003c/sub\u003e values, and its absence lowers predicted activity. This points to a direct relationship between Sulfur containing motifs in proximity to ring systems and the increased potency. A similar but less intense trend was shown for MACCSFP131 and MACCSFP86. These features generally increase the activity predicted, which is consistent with the notion that polar heteroatom linkers aid in molecule stabilization. In contrast, certain features such as MACCSFP125 (aromatic rings\u0026thinsp;\u0026gt;\u0026thinsp;1) and MACCSFP19 showed bidirectional behaviour, indicating context-dependent effects rather than uniform promotion or suppression of activity. Instead, their effect is determined by the chemical environment in which they are surrounded.\u003c/p\u003e \u003cp\u003eLocal SHAP explanations were examined for three representative compounds to illustrate model behaviour at the individual level. For the highly active compound CHEMBL4472528 (true pIC₅₀ = 9.00, predicted\u0026thinsp;=\u0026thinsp;8.71), the prediction was primarily driven by strong positive contributions from MACCSFP64 (+\u0026thinsp;0.78), MACCSFP131 (+\u0026thinsp;0.29), and MACCSFP86 (+\u0026thinsp;0.22) (Figs.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). These results reveal the function of the sulfur motifs near rings and heteroatom-rich linker regions in concert with each other in the compound scaffold. While a few other features provided smaller boosts of support, only one feature had a small negative effect. This strong balance of positive effects can explain why the predicted activity was so high and why the residual error was so low.\u003c/p\u003e \u003cp\u003eThe low-activity compound CHEMBL5820763 (true pIC₅₀ = 4.06, predicted\u0026thinsp;=\u0026thinsp;4.45) showed the opposite pattern (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). The absence of MACCSFP64 alone produced a large negative SHAP value of -0.93, which shifted the prediction to the low range of activities. Other negative impacts from MACCSFP146, MACCSFP86, and MACCSFP33 further made this trend even stronger. This profile indicates that the molecule lacks the heteroatom and ring motifs that the model is showing for active compounds. Since there were only small positive effects to balance this out, this model predicted low activity with small residual error. On the other hand, the case of CHEMBL5855611 (true pIC\u003csub\u003e50\u003c/sub\u003e\u0026thinsp;=\u0026thinsp;6.13, predicted\u0026thinsp;=\u0026thinsp;7.65) is interesting to show the case where the model is struggling. In this case, the effect of MACCSFP64 was a robust positive effect of +\u0026thinsp;0.64, supported by moderate increases from MACCSFP125 and MACCSFP145. Even with the features such as MACCSFP19 and MACCSFP91 attempting to pull the prediction back down, they were not strong enough to correct the overestimation (Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). This imbalance of strong signals of the positive with weak corrective features explains why the model was too optimistic. This behaviour is consistent with the compound\u0026rsquo;s elevated residual and supports its classification as a difficult or borderline case rather than a random error. Overall, the SHAP analysis confirms that the MACCS Random Forest model relies on chemically interpretable structural patterns and applies them consistently across the activity range. The dominance of a limited number of fingerprints explains the strong predictive performance, while the presence of competing positive and negative contributions in specific cases provides a transparent rationale for residual errors.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003ePrediction of Bioactivity for External Compounds\u003c/h2\u003e \u003cp\u003eThe final MACCS-Random Forest model was applied to an external compound set to evaluate its predictive utility beyond the curated ChEMBL dataset. Only molecules lying within the predefined applicability domain were retained, using a leverage threshold of h* = 0.15. From the screened library, 492 compounds satisfied this criterion and were considered suitable for reliable prediction (Tables S19-S20). Their predicted pIC₅₀ values clustered around moderate activity (mean\u0026thinsp;=\u0026thinsp;5.40), with a relatively narrow spread, indicating stable model behaviour when extrapolated to previously unseen chemical space.\u003c/p\u003e \u003cp\u003eFrom this in-domain pool, twenty compounds were prioritised based on higher predicted activity while maintaining low leverage values (Table S21). The predicted pIC\u003csub\u003e50\u003c/sub\u003e values of these top candidates were in the range of 5.99\u0026ndash;6.45 (mean\u0026thinsp;=\u0026thinsp;6.13) and are close to the median activity of the training set. Among these, CSSS00132718709 showed the highest predicted potency (pIC₅₀ = 6.45), followed by CSSS00161201427 (pIC₅₀ = 6.40), CSSS06454646810 (pIC₅₀ = 6.12) and ZINC000022348882 (pIC₅₀ = 6.37). All highlighted compounds exhibited leverage values well below the domain threshold, confirming that their predictions were not driven by structural extrapolation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eMolecular Docking Analysis\u003c/h2\u003e \u003cp\u003eMolecular docking was carried out with the crystallographic structure of the target protein (PDB ID: 6P6G) [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e] to assess the binding behaviour of the selected hit compounds with respect to the co-crystal ligand. The docking scores of CSSS06454646810, CHEMBL4472528, and the co-crystal ligand were \u0026minus;\u0026thinsp;8.679, -9.993, and \u0026minus;\u0026thinsp;10.580 kcal/mol, respectively, reflecting favourable binding for all three ligands within the experimentally validated active site. The co-crystal ligand showed the strongest binding affinity, as expected, while CHEMBL4472528 displayed a comparable score, suggesting a closely related interaction profile.\u003c/p\u003e \u003cp\u003eThe co-crystal ligand reproduced the key crystallographic interactions reported for the 6P6G structure [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. Two direct hydrogen bonds were observed, with \u003cem\u003eThr184\u003c/em\u003e interacting with the oxygen atom of the formamide group and \u003cem\u003eCys186\u003c/em\u003e interacting with the oxygen atom of the sulfonyl moiety. A protonated NH group formed a salt-bridge interaction with \u003cem\u003eAsp241\u003c/em\u003e at a distance of 3.96 \u0026Aring;, consistent with the acidic recognition motif described in the crystal structure (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). Several water-mediated interactions further stabilised the complex, including bridging contacts involving \u003cem\u003eLys297\u003c/em\u003e and the sulfonyl oxygen. Aromatic stabilisation was provided by a π-π stacking interaction between \u003cem\u003ePhe183\u003c/em\u003e and the isoxazole ring. Additional hydrophobic contacts were observed with residues lining the binding tunnel, including \u003cem\u003eTyr257, Tyr239, Met190, Leu240, Ile237, Cys238, Phe216\u003c/em\u003e, and \u003cem\u003eVal368\u003c/em\u003e. This interaction pattern closely matches the crystallographic binding mode reported for the reference ligand (Figs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). CHEMBL4472528 adopted a binding pose that strongly overlapped with the co-crystal ligand and preserved the major anchoring interactions within the active site. Three direct hydrogen bonds were identified, involving \u003cem\u003eTyr239\u003c/em\u003e and \u003cem\u003eThr184\u003c/em\u003e with the formamide oxygen, and \u003cem\u003eTyr257\u003c/em\u003e with the isoxazole oxygen. Water-mediated hydrogen bonds connected the ligand to \u003cem\u003eLys297\u003c/em\u003e and \u003cem\u003eSer182\u003c/em\u003e, while the acidic residues \u003cem\u003eGlu192, Asp241\u003c/em\u003e, and \u003cem\u003eGlu294\u003c/em\u003e formed a stabilising electrostatic environment around the protonated regions of the ligand. The binding pocket was further stabilised by extensive hydrophobic interactions with residues such as \u003cem\u003eCys333, Ile293, Leu290, Met190, Leu240, Cys238, Ile237, Phe216\u003c/em\u003e, and \u003cem\u003eVal368 (\u003c/em\u003eFigs.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e5\u003c/span\u003e). The preservation of \u003cem\u003eThr184\u003c/em\u003e anchoring and proximity to the \u003cem\u003eGlu192, Asp241\u003c/em\u003e acidic region indicates that CHEMBL4472528 engages the same critical residues as the crystallographic ligand.\u003c/p\u003e \u003cp\u003eCSSS06454646810 also occupied the validated active site and maintained essential interactions, although with a reduced number of direct polar contacts. Two hydrogen bonds were observed, involving \u003cem\u003eThr184\u003c/em\u003e with the NH group of the isoxazole ring and \u003cem\u003eGln252\u003c/em\u003e with the NH group of the pyrimidine moiety. A π-π stacking interaction between \u003cem\u003ePhe183\u003c/em\u003e and the isoxazole ring contributed to aromatic stabilisation. A water-mediated interaction connected \u003cem\u003eAsp241\u003c/em\u003e with the formamide NH, and \u003cem\u003eGlu192\u003c/em\u003e provided a stabilising negative electrostatic environment. Hydrophobic interactions were observed with residues including \u003cem\u003eMet190, Val368, Tyr257, Tyr239, Leu240, Cys238, Ile237, Cys180, Ile214\u003c/em\u003e, and \u003cem\u003eCys186 (\u003c/em\u003eFig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e). The cofactor SAH509 was positioned near the binding pocket within a 5 \u0026Aring; distance for this complex, suggesting potential proximity effects without directly interfering with ligand binding. Thus, the docking results demonstrate that all three ligands engage the same core binding residues identified in the crystallographic structure of PDB 6P6G. The consistent involvement of \u003cem\u003eThr184, Asp241, Glu192, Phe183\u003c/em\u003e, and the surrounding hydrophobic tunnel confirms that the predicted binding modes align with experimentally validated interaction patterns [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eMolecular Dynamics Simulation and Binding Energy Analysis\u003c/h2\u003e \u003cp\u003eThe reference co-crystal complex maintained a stable protein backbone over the 250 ns simulation. The protein heavy-atom RMSD remained centred at 2.65\u0026thinsp;\u0026plusmn;\u0026thinsp;0.25 \u0026Aring;, with a narrow interquartile range (Q1-Q3: 2.575\u0026ndash;2.792 \u0026Aring;), indicating limited global structural drift. The ligand RMSD indicated moderate variability (2.59\u0026thinsp;\u0026plusmn;\u0026thinsp;0.93 \u0026Aring;; median 2.97 \u0026Aring;). Based on the trajectory, the molecule did not appear to jump out of place, but rather to undergo a slow movement in position after the halfway point of the simulation. Despite this adjustment, the ligand was accommodated in the binding pocket throughout the simulation. Ligand physicochemical descriptors were stable, including PSA values of 172.8\u0026thinsp;\u0026plusmn;\u0026thinsp;4.6 A2 and MolSA values of 470.95\u0026thinsp;\u0026plusmn;\u0026thinsp;5.92, which indicated that there were no significant changes in solvent exposure and molecular conformation during the trajectory (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e)\u003c/p\u003e \u003cp\u003eCHEMBL4472528 was just as stable as the reference system but was even more so in some ways. Protein RMSD remained in a steady range (2.84\u0026thinsp;\u0026plusmn;\u0026thinsp;0.47 \u0026Aring;; median 3.03 \u0026Aring;), whereas the ligand itself exhibited lower and tighter fluctuations than one might anticipate (2.36\u0026thinsp;\u0026plusmn;\u0026thinsp;0.65 \u0026Aring;; median 2.54 \u0026Aring;; maximum 3.66 \u0026Aring;). The trajectory reflects an initial accommodation phase followed by sustained retention of the ligand within the binding pocket. Both the PSA (172.6\u0026thinsp;\u0026plusmn;\u0026thinsp;4.4 \u0026Aring;) and radius of gyration (6.39\u0026thinsp;\u0026plusmn;\u0026thinsp;0.36 \u0026Aring;) were stable, indicating that the molecule maintained its compact shape and had a consistent polar orientation within the cavity (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). On the other hand, CSSS06454646810 went through an entirely different dynamic path. Protein RMSD values were higher on average (3.27\u0026thinsp;\u0026plusmn;\u0026thinsp;0.39 \u0026Aring;; median 3.34 \u0026Aring;), and ligand RMSD relative to the protein was substantially elevated (5.54\u0026thinsp;\u0026plusmn;\u0026thinsp;1.34 \u0026Aring;; median 5.99 \u0026Aring;; maximum 7.25 \u0026Aring;). The trajectory shows that the ligand moved quite a lot at the beginning, eventually searching a much larger range of positions in the binding pocket (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e6\u003c/span\u003e). The average RMSF values (1.47\u0026thinsp;\u0026plusmn;\u0026thinsp;0.62 \u0026Aring;) indicate that the protein residues remained relatively constant, as in the other systems.\u003c/p\u003e \u003cp\u003eThroughout the simulation, the co-crystal ligand maintained a common network of polar, hydrophobic, and water-mediated interactions. A direct hydrogen bond remained between the oxygen atom of the formamide group and the polar residue \u003cem\u003eThr184\u003c/em\u003e, in agreement with the crystallographic binding mode. A conserved water molecule mediated the interactions between the \u003cem\u003eCys238\u003c/em\u003e and the NH group of the isoxazole moiety, and a second water molecule mediated the interactions between \u003cem\u003eIle214\u003c/em\u003e and the oxygen atom of the sulfur dioxide group. The NH substituted benzene ring was part of a hydrogen bonding network with water involving the negatively charged residues \u003cem\u003eAsp241\u003c/em\u003e and \u003cem\u003eGlu192\u003c/em\u003e, which contributes local electrostatic stabilisation.\u003c/p\u003e \u003cp\u003eAromatic and hydrophobic contacts were preserved, with \u003cem\u003ePhe183\u003c/em\u003e forming a π-π stacking interaction with the isoxazole ring and \u003cem\u003eTyr239\u003c/em\u003e and \u003cem\u003eTyr257\u003c/em\u003e providing non-bonded hydrophobic interactions. Secondary structure analysis confirmed preservation of the protein fold, with 39.09% α-helix, 11.37% β-strand, and a total SSE content of 50.46%. CHEMBL4472528 exhibited a well-anchored binding mode supported by multiple direct and water-mediated interactions. Three direct hydrogen bonds were consistently observed, involving \u003cem\u003eThr184\u003c/em\u003e with the formamide oxygen, \u003cem\u003eCys238\u003c/em\u003e with the formamide NH group, and the positively charged residue \u003cem\u003eLys329\u003c/em\u003e with the NH-substituted benzene ring. \u003cem\u003eLys329\u003c/em\u003e additionally formed a stabilising π-cation interaction with the benzene ring. Water-mediated interactions further reinforced binding stability, including a water bridge linking \u003cem\u003eMet242\u003c/em\u003e to the sulfur dioxide oxygen and another water molecule connecting \u003cem\u003eAsp332\u003c/em\u003e to the NH-substituted benzene ring. A conserved water molecule also linked \u003cem\u003eCys238\u003c/em\u003e to the NH group of the isoxazole moiety. Hydrophobic contacts involving \u003cem\u003ePhe183, Tyr239\u003c/em\u003e, and \u003cem\u003eTyr257\u003c/em\u003e were maintained throughout the simulation. The protein secondary structure remained stable, with 39.91% α-helix, 11.19% β-strand, and a total SSE of 51.10%, closely matching the reference system. Principal component analysis (PCA) shows that the dominant protein motions are captured by the first few principal components across all complexes, with ligand-dependent differences in conformational sampling. Compared with the co-crystal ligand, the CHEMBL4472528 and CSSS06454646810 complexes display altered clustering patterns, indicating modulation of collective motions upon ligand binding (Fig.\u0026nbsp;\u003cspan refid=\"Fig7\" class=\"InternalRef\"\u003e7\u003c/span\u003e). Dynamic cross-correlation matrix (DCCM) analysis reveals broadly conserved correlation patterns, with localized changes in correlated and anti-correlated residue motions in the hit compound complexes, suggesting ligand-specific effects on protein dynamics without global destabilization.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eIn contrast, CSSS06454646810 displayed a more limited and less persistent interaction network. A π-π stacking interaction between the benzene ring and \u003cem\u003ePhe183\u003c/em\u003e was retained and represented the primary aromatic stabilisation. Water-mediated hydrogen bonding involved \u003cem\u003eIle214\u003c/em\u003e, where a water molecule bridged the formamide oxygen, and \u003cem\u003eThr184\u003c/em\u003e, which interacted with the same water molecule that hydrogen-bonded to the NH group of the formamide. Non-bonded hydrophobic contacts with \u003cem\u003eTyr239\u003c/em\u003e and \u003cem\u003eTyr257\u003c/em\u003e were present but fewer and less persistent than those observed for the co-crystal ligand and CHEMBL4472528. Secondary structure analysis showed a modest reduction in overall SSE content (37.18% α-helix, 11.23% β-strand, total SSE 48.41%), consistent with the increased ligand mobility observed in RMSD analyses.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eMM-GBSA calculations were carried out at the initial docking pose (0 ns) and after 250 ns of MD simulation to evaluate binding energetics over time (Table S22). CHEMBL4472528 showed the most favourable binding free energies among the studied ligands, with ΔG\u003csub\u003ebind\u003c/sub\u003e values of -103.57 kcal/mol at 0 ns and \u0026minus;\u0026thinsp;96.27 kcal/mol at 250 ns. Binding was dominated by strong van der Waals (-72.12 to -69.39 kcal/mol) and lipophilic contributions (-50.53 to -46.99 kcal/mol), indicating stable hydrophobic packing within the binding pocket. An increase in electrostatic contributions at 250 ns suggests adaptive optimisation of polar interactions during equilibration. The co-crystal ligand exhibited strong initial binding (ΔG\u003csub\u003ebind\u003c/sub\u003e = -91.70 kcal/mol at 0 ns), supported by favourable Coulombic and van der Waals interactions. By the 250ns mark, the binding free energy had been reduced to -51.75 kcal/mol. This was mostly because of a decrease in van der Waals contributions and an increase in solvation penalties. This alteration in energy is comparable to the increased mobility of the ligand in the MD trajectory. In contrast, CSSS06454646810 initially showed a higher binding energy of -35.07 kcal/mol. However, the affinity improved significantly during the simulation, reaching \u0026minus;\u0026thinsp;78.87 kcal/mol by the 250 ns mark. This was probably because of the settling of this molecule into the hydrophobic pocket that enhanced the van der Waals and lipophilic forces. Even so, its binding energy remained less favorable than that of CHEMBL4472528, which makes sense in view of its higher RMSD, lower, and less persistent stabilizing interactions.\u003c/p\u003e \u003cp\u003e \u003cb\u003eNetwork and Pathway Context of\u003c/b\u003e \u003cb\u003eSMYD3\u003c/b\u003e \u003cb\u003eTargeting\u003c/b\u003e\u003c/p\u003e \u003cp\u003eThe structural results were put into the context of a wider biological question by analysing the target protein of \u003cem\u003eSMYD3\u003c/em\u003e's interaction network and pathway association. Protein-protein interaction analysis showed that there was a compact network with chromatin-associated proteins at the centre, and with direct connections to histone H3 variants such as \u003cem\u003eH3C12\u003c/em\u003e and \u003cem\u003eH3C13\u003c/em\u003e. The network also included functionally related chromatin regulators, including \u003cem\u003eKMT2E\u003c/em\u003e, indicating potential coordination among histone methylation systems [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e]. Molecular chaperones \u003cem\u003eHSP90AA1\u003c/em\u003e and \u003cem\u003eHSP90AB1\u003c/em\u003e were present, reflecting their known role in stabilising epigenetic enzymes and supporting conformational integrity [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e]. Transcription-associated proteins such as \u003cem\u003eESR1\u003c/em\u003e and \u003cem\u003eHELZ\u003c/em\u003e were also connected, reinforcing the involvement of \u003cem\u003eSMYD3\u003c/em\u003e in transcriptional regulation rather than broad signalling pathways [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e, \u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e] (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e and Table S23).\u003c/p\u003e \u003cp\u003eFunctional enrichment analysis supported these network observations. Gene Ontology biological process analysis highlighted chromatin organisation, nucleosome assembly, and epigenetic regulation of transcription as the most consistently represented functional categories within the network. Reactome and CTD pathway annotations showed concordant enrichment for chromatin-modifying enzymes and histone methylation processes, while pathways unrelated to nuclear or epigenetic regulation were weakly represented or absent. Disease association analysis using curated CTD annotations indicated strong links between \u003cem\u003eSMYD3\u003c/em\u003e and cancer-related conditions, including liver, breast, and lung malignancies. These associations are consistent with reported roles of \u003cem\u003eSMYD3\u003c/em\u003e in oncogenic transcriptional programs and epigenetic dysregulation. The TF subnetwork further shows that \u003cem\u003eSMYD3\u003c/em\u003e is linked to both activating and inhibitory transcriptional regulators, indicating its involvement in coordinating diverse gene regulatory programs rather than acting through a single pathway (Fig.\u0026nbsp;\u003cspan refid=\"Fig8\" class=\"InternalRef\"\u003e8\u003c/span\u003e). Within this framework, the structure-based prioritisation of CHEMBL4472528 provides a mechanistic context, as stable binding to \u003cem\u003eSMYD3\u003c/em\u003e would be expected to influence epigenetic regulation rather than peripheral signalling networks. Thus, the network and pathway analysis places \u003cem\u003eSMYD3\u003c/em\u003e within an epigenetic and transcription-related functional context. These findings align with the docking and molecular dynamics results, indicating that the observed ligand protein interactions are consistent with the known biological role of \u003cem\u003eSMYD3\u003c/em\u003e and remain within the limits of computational interpretation.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study presents a coherent computational framework integrating QSAR modelling, structure-based docking, molecular dynamics simulations, and network-level contextualisation to prioritise small-molecule inhibitors of \u003cem\u003eSMYD3\u003c/em\u003e. The approach was designed to balance predictive accuracy, chemical interpretability, and biological relevance. Among the models of QSAR that were evaluated, the most reliable external predictive performance was shown by the MACCS fingerprint using the Random Forest configuration approach. Its superiority over PubChem, CDK, and Substructure representations suggests that the activity landscape of \u003cem\u003eSMYD3\u003c/em\u003e inhibitors is better represented in terms of small structural keys than physicochemical descriptors. Consistent behaviour across multiple, random splits, together with favourable agreement measures on an independent test set, suggests that the chosen model is not based on chance-correlations or artefacts about the dataset they are trained on. Applicability domain and Y-randomization analyses further support this interpretation to prove that the predictive performance is due to the presence of true structure-activity relationships and not the over-fitting of the model.\u003c/p\u003e \u003cp\u003eSHAP interpretability analysis helped to get a glimpse of how the model converts chemical structures into predictions of bioactivity. A few MACCS fingerprints, in particular representing carbon-phosphorus connectivity, heterocyclic motifs, and aliphatic connections, dominated the feature importance. The direction and magnitude of these SHAP contributions were consistent for both active and inactive compounds, which is a good indication that the model is internally coherent. Local explanations for specific molecules helped to explain why certain predictions were very accurate while others had larger residuals, which helped to explain the model's decision-making process. This clarity helps to build confidence in the use of the model for hit prioritization. Using external screening within the applicability domain, a list of candidates with predicted activities close to the training median was then identified. Since there were no high-leverage predictions amongst the top hits, the prioritization remained within the confines of known chemical space rather than risky structural extrapolation. This conservative strategy is based on best practices in virtual screening when experimental validation is unavailable.\u003c/p\u003e \u003cp\u003eWe verified QSAR-derived hits by structure-based analysis using the \u003cem\u003eSMYD3\u003c/em\u003e crystal structure. Docking results showed that CHEMBL4472528 adopts a binding pose closely aligned with the co-crystal ligand, preserving key anchoring interactions involving \u003cem\u003eThr184, Asp241, Glu192\u003c/em\u003e, and \u003cem\u003ePhe183\u003c/em\u003e. These residues form the core recognition environment of the \u003cem\u003eSMYD3\u003c/em\u003e active site and have been highlighted in prior structural studies of SET-domain methyltransferases [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]. The consistency between predicted binding modes and experimentally observed interaction patterns supports the chemical plausibility of the prioritised scaffold. Molecular dynamics (MD) simulations were employed to further examine binding stability under explicit solvent conditions. Throughout the 250 ns trajectory, CHEMBL4472528 was stable with low RMSD values and persistent polar, aromatic, and water-mediated interactions. Its interaction network was similar to that of the reference ligand, and MM-GBSA analysis revealed favorable binding energetics with van der Waals and lipophilic contributions as the main driving force. On the other hand, CSSS06454646810 had higher flexibility and weaker persistent interactions, which is consistent with its weaker energetic profile. These results justify a focus on CHEMBL4472528 for a detailed discussion on the binding mode, while at the same time acknowledging other possible chemotypes.\u003c/p\u003e \u003cp\u003eNetwork and pathway analysis indicated that \u003cem\u003eSMYD3\u003c/em\u003e is in a very tight context of epigenetic and transcriptional regulation. Protein-protein interaction analysis revealed significant associations with histone H3 variants, chromatin-modifying enzymes, and molecular chaperones, which is in line with the known role of \u003cem\u003eSMYD3\u003c/em\u003e in chromatin organization [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e, \u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e]. Pathway enrichment using Gene Ontology, Reactome, and CTD annotations supported these results, and there was a high emphasis on processes involved in chromatin modification and cancer-associated pathways [\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e]. Significantly, the network was restricted to these areas and did not overlap with unrelated signaling pathways. This specificity implies that our structural results are part of rather than being the result of very diverse, non-specific associations. Thus, the convergence of QSAR performance, interpretable feature contributions, structure-based validation, dynamic stability, and network-level context supports the robustness of the proposed computational workflow. The results position CHEMBL4472528 as a structurally and biologically plausible \u003cem\u003eSMYD3\u003c/em\u003e inhibitor candidate, while demonstrating that integrative computational strategies can provide meaningful mechanistic insight even in the absence of experimental validation. This study, therefore, contributes a reproducible and interpretable framework for epigenetic target exploration using data-driven and structure-informed approaches.\u003c/p\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eLimitations and Future Perspectives\u003c/h2\u003e \u003cp\u003eThis study relies on computational analysis. The predicted inhibitory activity of the prioritised compounds has not been validated using experimental assays. Although the QSAR models were carefully validated using external tests, applicability domain analysis, and Y-randomization, their reliability depends on the chemical space covered by the ChEMBL dataset. Docking and molecular dynamics simulations describe binding stability and interaction patterns but do not directly measure enzymatic inhibition or cellular response. MM-GBSA energies provide relative trends and should not be interpreted as absolute binding affinities. Future work may focus on experimental testing of the top-ranked compounds, particularly CHEMBL4472528 and selected external hits, using \u003cem\u003eSMYD3\u003c/em\u003e inhibition assays. Integrating transcriptomic or epigenetic data following \u003cem\u003eSMYD3\u003c/em\u003e inhibition may also help clarify downstream biological effects and support translational relevance.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study presents an integrated computational strategy to prioritise small-molecule inhibitors of \u003cem\u003eSMYD3\u003c/em\u003e by combining QSAR modelling with structure-based and network-level analyses. The MACCS fingerprint-based Random Forest model showed stable performance across validation tests and provided interpretable structure-activity relationships, supporting its use as a reliable screening and prioritisation tool rather than a purely statistical model. Structure-based analysis confirmed that the prioritised compound binds within the experimentally characterised \u003cem\u003eSMYD3\u003c/em\u003e active site and maintains key residue interactions reported in crystallographic studies. Molecular dynamics simulations further supported stable ligand retention and favourable binding energetics over the simulation period, distinguishing CHEMBL4472528 from other candidates. These observations are consistent with the interaction patterns and energetic trends observed for the reference complex. Network and pathway analysis placed \u003cem\u003eSMYD3\u003c/em\u003e within a focused epigenetic and transcription-related context, linking the structural findings to its known biological role. This systems-level view supports the relevance of targeting \u003cem\u003eSMYD3\u003c/em\u003e without extending beyond the scope of computational interpretation. Thus, the results support CHEMBL4472528 as a structurally stable and biologically plausible \u003cem\u003eSMYD3\u003c/em\u003e inhibitor candidate and demonstrate the utility of combining validated QSAR modelling with molecular docking, molecular dynamics, and network analysis for rational inhibitor prioritisation.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e \u003ch2\u003eConflicts of Interest:\u003c/h2\u003e \u003cp\u003eThe authors declare no conflicts of interest.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eAbdullah R. Alzahrani: Conceptualization, methodology, investigation, supervision, and review of the manuscript; Zia Ur Rehman: Methodology, investigation, and manuscript drafting; Talha Jawaid: Investigation, formal analysis, and manuscript drafting; Abida Khan: Conceptualization, investigation, manuscript writing, review and editing, and corresponding author responsibilities. All authors have read and approved the final manuscript and agree to be accountable for the integrity and accuracy of the work.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e \u003cp\u003eThe authors extend their appreciation to the Deanship of Scientific research at Northern Border University, Arar, KSA for funding this research work through the project number \u0026ldquo;NBU-FFR-2026\u0026ndash;2042\u0026ndash;03\u0026rsquo;.\u003c/p\u003e\u003ch2\u003eData Availability Statement:\u003c/h2\u003e \u003cp\u003eThe original contributions presented in this study are included in the article and its supplementary file.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eMicallef I, Baron B (2025) Therapeutic Targeting of Protein Lysine and Arginine Methyltransferases: Principles and Strategies for Inhibitor Design. Int J Mol Sci 26:9038. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/IJMS26189038\u003c/span\u003e\u003cspan address=\"10.3390/IJMS26189038\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang Z, Chen X, Zhu C et al (2025) Direct lysine dimethylation of IRF3 by the methyltransferase SMYD3 attenuates antiviral innate immunity. Proc Natl Acad Sci USA 122:e2320644122. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1073/PNAS.2320644122\u003c/span\u003e\u003cspan address=\"10.1073/PNAS.2320644122\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBernard BJ, Nigam N, Burkitt K, Saloura V (2021) SMYD3: a regulator of epigenetic and signaling pathways in cancer. Clin Epigenetics 13:45. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/S13148-021-01021-9\u003c/span\u003e\u003cspan address=\"10.1186/S13148-021-01021-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNigam N, Bernard B, Sevilla S et al (2023) SMYD3 represses tumor-intrinsic interferon response in HPV-negative squamous cell carcinoma of the head and neck. Cell Rep 42:112823. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.CELREP.2023.112823\u003c/span\u003e\u003cspan address=\"10.1016/J.CELREP.2023.112823\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMazur PK, Gozani O, Sage J, Reynoird N (2016) Novel insights into the oncogenic function of the SMYD3 lysine methyltransferase. Transl Cancer Res 5:330\u0026ndash;333. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.21037/TCR.2016.06.26\u003c/span\u003e\u003cspan address=\"10.21037/TCR.2016.06.26\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao L, Wang Z, Cheng P et al (2025) SMYD3\u0026ndash;CDCP1 Axis Drives EMT and CAF Activation in Colorectal Cancer and Is Targetable for Oxaliplatin Sensitization. Biomedicines 13:2737. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/BIOMEDICINES13112737\u003c/span\u003e\u003cspan address=\"10.3390/BIOMEDICINES13112737\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu Z, Zhao X, Zang M et al (2025) SMYD3 Promotes Immune Evasion in Clear Cell Renal Cell Carcinoma via SREBP1-Mediated Transactivation of CD47. Adv Sci (Weinheim, Baden-Wurttemberg. Ger 12:e04200. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/ADVS.202404200\u003c/span\u003e\u003cspan address=\"10.1002/ADVS.202404200\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYang Z, Liu F, Li Z et al (2023) Histone lysine methyltransferase SMYD3 promotes oral squamous cell carcinoma tumorigenesis via H3K4me3-mediated HMGA2 transcription. Clin Epigenetics 15:92. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/S13148-023-01506-9\u003c/span\u003e\u003cspan address=\"10.1186/S13148-023-01506-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYu X, Zhao H, Wang R et al (2024) Cancer epigenetics: from laboratory studies and clinical trials to precision medicine. Cell Death Discov 10:28. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/s41420-024-01803-z\u003c/span\u003e\u003cspan address=\"10.1038/s41420-024-01803-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJi Y, Chen Z, Cai J (2025) Roles and mechanisms of histone methylation in vascular aging and related diseases. Clin Epigenetics 17:35. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1186/S13148-025-01842-Y\u003c/span\u003e\u003cspan address=\"10.1186/S13148-025-01842-Y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMo L, Deng M, Adhav R et al (2025) Oncogenic activation of SMYD3-SHCBP1 promotes breast cancer development and is coupled with resistance to immune therapy. Cell Death Dis 16:220. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/S41419-025-07570-8\u003c/span\u003e\u003cspan address=\"10.1038/S41419-025-07570-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFasano C, Lepore Signorile M, De Marco K et al (2022) Identifying novel SMYD3 interactors on the trail of cancer hallmarks. Comput Struct Biotechnol J 20:1860\u0026ndash;1875. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.CSBJ.2022.03.037\u003c/span\u003e\u003cspan address=\"10.1016/J.CSBJ.2022.03.037\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYap CW (2011) PaDEL-descriptor: An open source software to calculate molecular descriptors and fingerprints. J Comput Chem 32:1466\u0026ndash;1474. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/JCC.21707\u003c/span\u003e\u003cspan address=\"10.1002/JCC.21707\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQurat Ul Ain S, Islam Rather KU (2025) Integrated statistical modeling and machine learning techniques with SHAP for epidemiological data analysis. Ann Epidemiol 108:85\u0026ndash;91. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.ANNEPIDEM.2025.06.012\u003c/span\u003e\u003cspan address=\"10.1016/J.ANNEPIDEM.2025.06.012\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDu Y, Liu X, Shah N et al (2022) ChemSpacE: Interpretable and Interactive Chemical Space Exploration. ChemRxiv. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.26434/CHEMRXIV-2022-X49MH-V3\u003c/span\u003e\u003cspan address=\"10.26434/CHEMRXIV-2022-X49MH-V3\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTingle BI, Tang KG, Castanon M et al (2023) ZINC-22\u0026mdash;A Free Multi-Billion-Scale Database of Tangible Compounds for Ligand Discovery. J Chem Inf Model 63:1166. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/ACS.JCIM.2C01253\u003c/span\u003e\u003cspan address=\"10.1021/ACS.JCIM.2C01253\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSu DS, Qu J, Schulz M et al (2019) Discovery of Isoxazole Amides as Potent and Selective SMYD3 Inhibitors. ACS Med Chem Lett 11:133\u0026ndash;140. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/ACSMEDCHEMLETT.9B00493\u003c/span\u003e\u003cspan address=\"10.1021/ACSMEDCHEMLETT.9B00493\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMadhavi Sastry G, Adzhigirey M, Day T et al (2013) Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. J Comput Aided Mol Des 27:221\u0026ndash;234. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1007/S10822-013-9644-8\u003c/span\u003e\u003cspan address=\"10.1007/S10822-013-9644-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFriesner RA, Banks JL, Murphy RB et al (2004) Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J Med Chem 47:1739\u0026ndash;1749. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/JM0306430\u003c/span\u003e\u003cspan address=\"10.1021/JM0306430\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBowers KJ, Chow E, Xu H et al (2006) Scalable algorithms for molecular dynamics simulations on commodity clusters. Proc 2006. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1145/1188455.1188544\u003c/span\u003e\u003cspan address=\"10.1145/1188455.1188544\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. ACM/IEEE Conf Supercomput SC\u0026rsquo;06\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLu C, Wu C, Ghoreishi D et al (2021) OPLS4: Improving Force Field Accuracy on Challenging Regimes of Chemical Space. J Chem Theory Comput 17:4291\u0026ndash;4300. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1021/ACS.JCTC.1C00302\u003c/span\u003e\u003cspan address=\"10.1021/ACS.JCTC.1C00302\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eManaithiya A, Bhowmik R, Acharjee S et al (2024) Elucidating molecular mechanism and chemical space of chalcones through biological networks and machine learning approaches. Comput Struct Biotechnol J 23:2811\u0026ndash;2836. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1016/J.CSBJ.2024.07.006\u003c/span\u003e\u003cspan address=\"10.1016/J.CSBJ.2024.07.006\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSzklarczyk D, Nastou K, Koutrouli M et al (2025) The STRING database in 2025: protein networks with directionality of regulation. Nucleic Acids Res 53:D730\u0026ndash;D737. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/NAR/GKAE1113\u003c/span\u003e\u003cspan address=\"10.1093/NAR/GKAE1113\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDavis AP, Wiegers TC, Sciaky D et al (2025) Comparative Toxicogenomics Database\u0026rsquo;s 20th anniversary: update 2025. Nucleic Acids Res 53:D1328\u0026ndash;D1334. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/NAR/GKAE883\u003c/span\u003e\u003cspan address=\"10.1093/NAR/GKAE883\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBochyńska A, L\u0026uuml;scher-Firzlaff J, L\u0026uuml;scher B (2018) Modes of Interaction of KMT2 Histone H3 Lysine 4 Methyltransferase/COMPASS Complexes with Chromatin. Cells 7:17. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.3390/CELLS7030017\u003c/span\u003e\u003cspan address=\"10.3390/CELLS7030017\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbu-Farha M, Lanouette S, Elisma F et al (2011) Proteomic analyses of the SMYD family interactomes identify HSP90 as a novel target for SMYD2. J Mol Cell Biol 3:301\u0026ndash;308. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/JMCB/MJR025\u003c/span\u003e\u003cspan address=\"10.1093/JMCB/MJR025\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHamamoto R, Furukawa Y, Morita M et al (2004) SMYD3 encodes a histone methyltransferase involved in the proliferation of cancer cells. Nat Cell Biol 6:731\u0026ndash;740. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/ncb1151\u003c/span\u003e\u003cspan address=\"10.1038/ncb1151\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSilva FP, Hamamoto R, Kunizaki M et al (2007) Enhanced methyltransferase activity of SMYD3 by the cleavage of its N-terminal region in human cancer cells. Oncogene 27:2686\u0026ndash;2692. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1038/sj.onc.1210929\u003c/span\u003e\u003cspan address=\"10.1038/sj.onc.1210929\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDing Q, Cai J, Jin L et al (2024) A novel small molecule ZYZ384 targeting SMYD3 for hepatocellular carcinoma via reducing H3K4 trimethylation of the Rac1 promoter. MedComm 5:e711. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1002/MCO2.711\u003c/span\u003e\u003cspan address=\"10.1002/MCO2.711\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eForeman KW, Brown M, Park F et al (2011) Structural and Functional Profiling of the Human Histone Methyltransferase SMYD3. PLoS ONE 6:e22290. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1371/JOURNAL.PONE.0022290\u003c/span\u003e\u003cspan address=\"10.1371/JOURNAL.PONE.0022290\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFu W, Liu N, Qiao Q et al (2016) Structural basis for substrate preference of SMYD3, a SET domain-containing protein lysine methyltransferase. J Biol Chem 291:9173\u0026ndash;9180. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1074/jbc.M115.709832\u003c/span\u003e\u003cspan address=\"10.1074/jbc.M115.709832\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"molecular-diversity","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"modi","sideBox":"Learn more about [Molecular Diversity](http://link.springer.com/journal/11030)","snPcode":"11030","submissionUrl":"https://submission.nature.com/new-submission/11030/3","title":"Molecular Diversity","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"SMYD3, Cancer, Machine learning QSAR, Prediction model, Molecular Modelling, Network Biology","lastPublishedDoi":"10.21203/rs.3.rs-8662415/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8662415/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003e \u003cem\u003eSMYD3\u003c/em\u003e is a lysine methyltransferase involved in epigenetic regulation and oncogenic transcription, making it an attractive yet challenging therapeutic target. This study presents an integrated computational workflow combining machine learning based quantitative structure-activity relationship (QSAR) modelling, external bioactivity prediction, molecular docking, molecular dynamics (MD) simulations, and network analysis to prioritize potential \u003cem\u003eSMYD3\u003c/em\u003e inhibitors. ML-QSAR models were constructed using multiple molecular descriptor representations and regression algorithms. A MACCS fingerprint-based Random Forest model showed the most reliable external predictivity, supported by cross-validation, applicability domain assessment, and Y-randomization analysis. Feature interpretability using SHAP highlighted a small set of chemically meaningful structural patterns that consistently influenced activity prediction. The validated model was then applied to an external compound library, and bioactivity was predicted only for compounds lying within the defined applicability domain. This screening enabled the prioritization of in-domain candidates with moderate predicted potency and acceptable structural coverage relative to the training space. Structure-based evaluation using the crystallographic \u003cem\u003eSMYD3\u003c/em\u003e structure demonstrated that selected compounds bind within the experimentally validated active site and engage key residues observed in the co-crystal complex. Extended 250 ns MD simulations indicated that CHEMBL4472528 maintained stable binding, persistent polar and hydrophobic interactions, and favorable binding free energies compared with both the co-crystal ligand and other screened candidates. Network and pathway analysis further placed \u003cem\u003eSMYD3\u003c/em\u003e within a focused chromatin-associated and transcriptional regulatory context, supporting the biological relevance of the target. This work provides a reproducible computational framework for \u003cem\u003eSMYD3\u003c/em\u003e inhibitor prioritization and highlights CHEMBL4472528 as a promising scaffold for further investigation.\u003c/p\u003e","manuscriptTitle":"Predictive Bioactivity Modeling and Structural Binding Analysis for the Identification of Potential SMYD3 Modulators","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-01-28 06:36:16","doi":"10.21203/rs.3.rs-8662415/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-03-02T11:16:41+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-03-01T03:44:23+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-02-22T20:21:58+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"261234179319519088388393126310923667433","date":"2026-02-13T10:06:08+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"81862786202667181316119748649707490906","date":"2026-02-09T09:48:26+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2026-01-22T21:16:23+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-01-22T14:13:25+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-01-22T11:00:26+00:00","index":"","fulltext":""},{"type":"submitted","content":"Molecular Diversity","date":"2026-01-21T17:07:19+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"molecular-diversity","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"modi","sideBox":"Learn more about [Molecular Diversity](http://link.springer.com/journal/11030)","snPcode":"11030","submissionUrl":"https://submission.nature.com/new-submission/11030/3","title":"Molecular Diversity","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"de84b4ff-e13e-4f37-81fb-213de8e89c6c","owner":[],"postedDate":"January 28th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-04-13T16:06:29+00:00","versionOfRecord":{"articleIdentity":"rs-8662415","link":"https://doi.org/10.1007/s11030-026-11533-2","journal":{"identity":"molecular-diversity","isVorOnly":false,"title":"Molecular Diversity"},"publishedOn":"2026-04-10 15:58:35","publishedOnDateReadable":"April 10th, 2026"},"versionCreatedAt":"2026-01-28 06:36:16","video":"","vorDoi":"10.1007/s11030-026-11533-2","vorDoiUrl":"https://doi.org/10.1007/s11030-026-11533-2","workflowStages":[]},"version":"v1","identity":"rs-8662415","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8662415","identity":"rs-8662415","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00