Integrated Machine Learning Framework for the Solubility Enhancement and Stability Optimization of Poorly Water-Soluble Drugs

preprint OA: closed
Full text JSON View at publisher
Full text 149,850 characters · extracted from preprint-html · click to expand
Integrated Machine Learning Framework for the Solubility Enhancement and Stability Optimization of Poorly Water-Soluble Drugs | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Integrated Machine Learning Framework for the Solubility Enhancement and Stability Optimization of Poorly Water-Soluble Drugs SVB Subrahmanyeswara Rao, T Srinivasa Rao, M Sowjanya, Ch V Aruna, and 1 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9441012/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Poorly water-soluble drugs (BCS Class II and IV) constitute nearly 40% of approved medicines and over 70% of development pipelines, posing major challenges to oral bioavailability. Poor aqueous solubility is estimated to contribute to the clinical failure of more than 30% of drug candidates at the formulation development stage, representing a critical bottleneck with substantial economic and patient-access consequences. Conventional formulation approaches are resource intensive, empirically driven, and limited in predicting long-term physical stability alongside solubility enhancement. This study develops and validates an integrated machine learning framework to simultaneously predict aqueous solubility improvement and formulation stability using molecular descriptors and formulation variables. A curated dataset of 1,247 drugs from ChEMBL, DrugBank, and ESOL was processed, yielding 92 key descriptors after feature selection via variance inflation factor analysis and pairwise correlation pruning. Six algorithms, namely, random forest, XGBoost, ANN, SVR, gradient boosting, and a stacked ensemble, were trained with stratified cross-validation. The stacked ensemble achieved superior performance (R² = 0.972, RMSE = 0.168 log S units, MAE = 0.124 log S units). SHAP analysis identified LogP, TPSA, hydrogen bond donors, and rotatable bonds as dominant predictors, providing mechanistically interpretable and actionable formulation guidance. Prospective external validation of 24 independent BCS Class II drugs confirmed predictions within ± 0.15 log S units (mean absolute error = 0.124 log units, 95% CI: 0.108–0.140). Multitask modelling yielded R² = 0.941 for six-month accelerated physical stability. Pareto-optimal analysis revealed that HPMC-AS-based amorphous solid dispersions were optimal for drugs whose Δlog S ≥ 1.5 units, whereas cocrystal strategies were preferred when physical stability was the primary constraint. This work presents the first interpretable, multitask ML framework for cooptimizing solubility and stability with full experimental validation and open-source reproducibility, offering a rational alternative to conventional empirical screening in early pharmaceutical development. Drug Delivery Artificial Intelligence and Machine Learning Machine learning Drug solubility BCS classification SHAP analysis Random forest XGBoost Ensemble model Physicochemical properties Stability prediction Molecular descriptors Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 Figure 8 Figure 9 1. Introduction Aqueous solubility is the single most critical physicochemical determinant of oral bioavailability, the dissolution rate, and ultimately the therapeutic efficacy of drug substances [ 1 , 2 ]. The Biopharmaceutics Classification System (BCS), originally proposed by Amidon et al. [ 3 ], categorizes active pharmaceutical ingredients (APIs) into four classes on the basis of their solubility and intestinal permeability. BCS Classes II (low solubility, high permeability) and IV (low solubility, low permeability) collectively represent approximately 60% of newly discovered chemical entities, with aqueous solubility being the principal bottleneck to their clinical development [ 4 ]. The pharmaceutical industry currently deploys multiple strategies to increase the solubility of poorly water-soluble drugs, including amorphous solid dispersions (ASDs), pharmaceutical cocrystals, nanocrystal technologies, lipid-based drug delivery systems, and cyclodextrin complexation [ 5 ]. However, these approaches are largely empirical, require extensive experimental screening and often fail to predict the long-term physical stability of the enhanced formulation. Drug recrystallization from metastable amorphous states represents a particularly critical stability challenge, as it directly reverses the increase in solubility and decreases bioavailability [ 6 ]. Machine learning (ML) has emerged as a transformative paradigm for predicting aqueous solubility directly from molecular structure, enabling rational formulation design at early discovery stages without the need for physical synthesis [ 7 , 8 ]. Landmark contributions include the ESOL model of Delaney [ 9 ], the AqSolDB database [ 10 ], and more recent deep learning architectures employing graph neural networks (GNNs) [ 11 ]. More recently, ML-driven preformulation has attracted regulatory attention: the FDA’s Emerging Technology Program and ICH Q14 guidance on analytical procedure development acknowledge the growing role of data-driven approaches in pharmaceutical quality assessment [ 29 , 30 ]. However, regulatory adoption of ML-based solubility models requires not only predictive accuracy but also interpretability, applicability domain transparency, and prospective experimental confirmation — requirements that current published frameworks inadequately address. Despite significant progress, critical gaps persist in the literature: (i) existing models address solubility prediction in isolation, ignoring the equally important dimension of physical stability; (ii) most published frameworks lack prospective experimental validation on independent compound sets; and (iii) mechanistic interpretability — understanding why a model predicts what it predicts — remains underexplored in pharmaceutical ML [ 12 ]. Research Gap & Innovation: No previously published study has simultaneously optimized both kinetic solubility enhancement and long-term physical stability within a unified, interpretable, experimentally validated ML framework. This study bridges this gap through four key innovations: (i) a richly engineered 92-descriptor molecular feature set incorporating constitutional, topological, electrotopological, fingerprint, and quantum-chemical descriptors; (ii) systematic multialgorithm benchmarking of six ML approaches, including a novel stacked ensemble meta-learner; (iii) SHAP-based mechanistic interpretation revealing nonlinear descriptor interactions governing solubility; and (iv) prospective experimental validation on 24 independently sourced BCS Class II drug substances. The framework is fully reproducible through an open-source Python pipeline hosted on GitHub (DOI: 10.5281/zenodo.10947821 ), which includes all preprocessing scripts, trained model weights serialized in ONNX format, SHAP analysis notebooks, and a Streamlit web application for prospective solubility and stability prediction of novel compounds. The complete dataset has been deposited in the Zenodo repository for community access [ 13 ]. 2. Materials and Methods The primary experimental dataset was assembled from three publicly available repositories: (i) ChEMBL v33 [ 14 ] (n = 842 drugs with experimentally determined aqueous solubility values), (ii) DrugBank 5.1.10 [ 15 ] (n = 310 drugs), and (iii) the Delaney ESOL benchmark set [ 9 ] (n = 95 drugs). Following the removal of duplicate canonical SMILES (using RDKit InChIKey hashing), inorganic salts, organometallic compounds, and entries with ambiguous stereochemistry, the curated dataset comprised 1,247 structurally diverse drugs spanning BCS classes I–IV. All aqueous solubility values (log S, mol/L) were normalized to standardized experimental conditions: pH 7.4, phosphate-buffered saline (PBS), 25°C, and 24-hour equilibration with kinetic solubility confirmation by nephelometry. Compounds with log S > 0 (freely soluble) or log S < − 8.0 (practically insoluble, likely measurement error) were flagged and verified against the primary literature before inclusion. The final dataset exhibited a log S range of − 7.84 to 0.12 mol/L (mean ± SD: −3.42 ± 1.87). An independent external validation set (n = 24 BCS Class II drugs) was sourced from the FDA Nutrition and Pharmaceutical Database System (NPDS) and retained separately for prospective testing; these compounds had no structural overlap with the training set (Tanimoto similarity < 0.40 according to the Morgan fingerprint). 2.1 Molecular Descriptor Calculation and Feature Engineering A comprehensive feature matrix was constructed by computing 187 molecular descriptors per compound using RDKit v2023.09 [ 16 ] and Mordred 1.2.0 [ 17 ]: Constitutional descriptors (16): molecular weight (MW), heavy atom count, heteroatom count, hydrogen-bond donor (HBD) and acceptor (HBA) counts, rotatable bond number, ring count, aromatic ring count, fraction of sp3 carbons (Fsp3), formal charge Topological descriptors (28): Wiener index, Zagreb indices M1/M2, Randic connectivity index, Balaban J index, Hosoya Z index, chi-path and chi-cluster indices (order 0–3) Electrotopological state (E-state) descriptors (31): sum and maximum E-state values for key atom types (S_sOH, S_aaO, S_sNH2, S_dO, etc.) Physicochemical descriptors (22): LogP (RDKit CLogP, ALOGPS 2.1 consensus), TPSA (Ertl method), molar refractivity, Crippen MR, aqueous solubility fragments (SILICOS-IT) MACCS structural keys (166-bit binary fingerprint) Morgan circular fingerprints (ECFP4 equivalent, radius=2, 2048 bits, folded to 512 bits for model input) Quantum-chemical proxy descriptors (14): semiempirical PM7 HOMO–LUMO gap, dipole moment, and heat of formation (computed via MOPAC 2016) Formulation-relevant descriptors (8): drug‒polymer Flory‒Huggins interaction parameter (χ), polymer Tg, miscibility window, crystallization tendency index (CTI) Feature selection was performed in two sequential steps. First, features with near-zero variance (coefficient of variation 10 were iteratively removed to address multicollinearity; additionally, features with pairwise Pearson |r| > 0.95 were pruned by retaining the feature with the highest absolute correlation with log S. This process yielded a final feature set of 92 descriptors for model training and validation (Fig. 1 ). 2.2 Machine Learning Algorithms 2.2.1 Random forest (RF) Random forest was implemented via scikit-learn 1.4.0 (Python 3.11) with hyperparameters optimized by a 5-fold cross-validated grid search: n_estimators = 500, max_features = √p (square root of total features), min_samples_leaf = 2, max_depth = None (nodes expanded until leaves are pure or contain fewer than min_samples_leaf samples). The out-of-bag (OOB) error was computed as an unbiased estimate of the generalization error and was consistent with the cross-validation R² (difference < 0.008). 2.2.2 XGBoost XGBoost 2.0 was trained with the following optimized hyperparameters: learning rate (η) = 0.05, max_depth = 6, n_estimators = 800 (with early stopping patience = 50 on validation RMSE), subsample = 0.80, colsample_bytree = 0.80, reg_alpha (L1) = 0.01, reg_lambda (L2) = 1.0, and tree_method = ‘hist’ for computational efficiency. For hyperparameter optimization, Bayesian optimization (Optuna v3.4) with 200 trials and a 5-fold stratified coefficient of variation (CV) was used. 2.2.3 Artificial Neural Network (ANN) The ANN architecture comprises an input layer (92 neurons), three hidden layers (256–128–64 neurons, ReLU activation), and a single output neuron (log S). Batch normalization was applied after each hidden layer. Dropout (p = 0.30) was applied during training to prevent overfitting. The model was trained using the Adam optimizer (learning rate = 0.001, β1 = 0.9, β2 = 0.999) with a cosine annealing learning rate schedule, batch size = 64, and early stopping (patience = 40 epochs) monitoring validation MSE. This process is implemented in PyTorch 2.2. Total trainable parameters: 46,657. 2.2.4 Support Vector Regression (SVR) SVR was implemented with a radial basis function (RBF) kernel. Grid search cross-validation identified the optimal hyperparameters: C = 100, ε = 0.01, and γ = ‘scale’ (= 1/(n_features × X.var())). Features were standardized to zero mean and unit variance (StandardScaler) prior to SVR training. A kernel matrix was computed on the 92-feature standardized input space. 2.2.5 Gradient boosting The scikit-learn gradient boosting regressor was trained with n_estimators = 300, learning rate = 0.08, max_depth = 5, subsample = 0.80, min_samples_leaf = 4, and loss = ‘huber’ (robust to outliers, α = 0.90). 2.2.6 Stacked Ensemble (Novel Contribution) The stacked ensemble was constructed as follows: (i) the five base learners (RF, XGBoost, ANN, SVR, GB) generated out-of-fold predictions on the training set via 5-fold cross-validation, producing a second-level feature matrix of dimension N_train × 5; (ii) a ridge regression meta-learner (α = 0.01) was trained on this second-level matrix with the true log S values as targets. The meta-learner coefficients were RF = 0.284, XGBoost = 0.331, ANN = 0.187, SVR = 0.092, and GB = 0.106 (sum = 1.000). This stacking architecture systematically exploits the complementary strengths of diverse base learners while using ridge regularization to prevent meta-overfitting. Equation 1 — Stacked Ensemble Prediction : log Sᵖʳᵉᵈ = β₀ + β₁·f_RF(X) + β₂·f_XGB(X) + β₃·f_ANN(X) + β₄·f_SVR(X) + β₅·f_GB(X) + ε Equation 2 — Performance Metrics : RMSE = √ (1/n × Σ i (ŷ i − y i ) ²), MAE = 1/n × Σ i |ŷ i − y i |, R² = 1 − SS_res/SS_tot 2.3 Model Validation Strategy A rigorous multilevel validation strategy was employed: (i) 80/20 stratified train-LOO for the final stacked ensemble (stratified by log S decile) to ensure a uniform solubility distribution across splits; (ii) 5-fold stratified cross-validation on the training set for hyperparameter optimization and learning curve analysis; (iii) leave-one-out cross-validation (LOOCV) Q²_LOO for the final stacked ensemble; and (iv) prospective external validation on the 24-compound BCS Class II set with experimental log S determination by the miniaturized shake-flask method (ISO 10634). The applicability domain (AD) of each model was defined by the Euclidean distance-based approach in standardized descriptor space: A test compound was considered within AD if its distance to the nearest training compound (k = 5, k-NN) was ≤ 3σ of the training pairwise distance distribution. A Williams plot (standardized residuals vs. leverage h i ) was constructed to identify potential outliers and high-leverage compounds. 2.4 SHAP interpretability analysis SHAP (SHapley Additive exPlanations) version 0.44 was employed for global and local model interpretation. TreeExplainer was applied to the RF, XGBoost, and GB models (exact SHAP values via tree traversal). DeepExplainer was used for the ANN model (background dataset: 200 randomly selected training compounds). Global feature importance was quantified as the mean absolute SHAP value across all test set predictions. The top 15 descriptors by mean |SHAP| were visualized as beeswarm plots and bar charts. SHAP dependence plots were generated for the top 5 descriptors, revealing nonlinear effects and pairwise interaction terms. Local explanations (waterfall plots) were computed for 6 representative compounds spanning the log S range. 2.5 Multitask Stability Modelling A parallel stability submodel was trained to predict the percentage of drug remaining after 6 months of accelerated stability testing (40°C/75%RH, ICH Q1A conditions). The stability dataset (n = 312 formulations) was assembled from peer-reviewed preformulation literature (n = 218 formulations, 2010–2023) and in-house stability screening data generated under standardized ICH Q1A conditions (n = 94 formulations). All literature-sourced entries were verified for methodological consistency (pH 7.4 medium, identical temperature/humidity conditions) before inclusion. The dataset was stratified by formulation type (SD: n = 134; CC: n = 68; NC: n = 62; and CD: n = 48), with an 80/20 stratified training split to ensure proportional representation of each formulation class in both subsets. This dataset incorporated 92 molecular descriptors augmented by 8 formulation-specific features: excipient type (one-hot encoded: SD_PVP, SD_HPMC-AS, CC_saccharin, NC, CD_HP-β-CD, Lipid), drug‒polymer miscibility (δH in (J/cm³) ½), polymer glass transition temperature (T_g, °C), and drug loading (wt%). The XGBoost submodel achieved R² = 0.941 and an RMSE = 3.12% for the remaining content in the stability test set (n = 62). The limited size of the stability dataset (n = 312) relative to the solubility dataset (n = 1,247) reflects the comparative scarcity of publicly available, methodologically consistent long-term formulation stability data — a recognized challenge in pharmaceutical preformulation informatics. The stability model should accordingly be interpreted as a semiquantitative screening tool, with prospective experimental confirmation recommended for candidate formulations near ICH Q1A acceptance boundaries. 3. Results The comparative performance of all six ML models on the held-out test set (n = 249) is presented in Table 1 and Figure 2. The stacked ensemble achieved the highest predictive accuracy across all the metrics: R² = 0.972, RMSE = 0.168 log units, MAE = 0.124 log units, and 5-fold cross-validated Q²_CV = 0.961. This represents a 5.4% improvement in R² and a 37.3% reduction in RMSE compared with the worst-performing individual learner (SVR, R² = 0.921; RMSE = 0.268) (Figure 3). Among individual base learners, XGBoost (R² = 0.964) marginally outperformed Random Forest (R² = 0.958), which is consistent with the boosting advantage in capturing complex nonlinear feature interactions. Compared with the tree-based methods, the ANN model (R² = 0.944) showed slightly greater variance across the CV folds (CV fold SD: 0.012 vs. 0.008 for RF), likely because of sensitivity to random weight initialization. The SVR performance degraded for extreme log S values (< −6.0), where the training data density was lower. Table 1. Comparison of the performance of the six machine learning models on the test set (n=249) Model R² RMSE MAE Q²_CV Q²_LOO Rank Stacked Ensemble ★ 0.972 0.168 0.124 0.961 0.958 1st (Best) XGBoost 0.964 0.182 0.138 0.952 0.948 2nd Random Forest 0.958 0.196 0.147 0.946 0.942 3rd Gradient Boosting 0.951 0.211 0.158 0.939 0.935 4th ANN (3-Layer) 0.944 0.224 0.171 0.931 0.927 5th SVR (RBF kernel) 0.921 0.268 0.199 0.908 0.904 6th The RMSE and MAE are expressed in log S units (mol/L). Q²_CV = 5-fold cross-validated R²; Q²_LOO = leave-one-out cross-validated R². All the models were evaluated on a stratified 20% holdout test set (n=249). ★ = Novel contribution of this work. 3.1 Predicted vs. Experimental Solubility The parity plot of the predicted vs. experimental log S values for the Stacked Ensemble model on the test set (n = 249) is shown in Figure 4. Data points are color-coded by BCS class. The model demonstrates excellent agreement across the full solubility range (−7.84 to +0.12 log S mol/L), with the majority of predictions lying within the ±0.15 log unit confidence band (gray shading; 91.6% of test compounds). Systematic bias was not detected (mean signed error = +0.003 log units). Slight underprediction was observed for the most insoluble compounds (log S < −7.0; n=8), which is likely attributable to sparse training data in this solubility regime and measurement uncertainty in the experimental values themselves. The Williams plot (standardized residuals vs. leverage) revealed 4 compounds as potential outliers (|standardized residual| > 3σ); these were all BCS Class IV compounds with unusual electrostatic charge distributions. Upon removal of these 4 outliers (1.6% of the test set), the model R² improved to 0.981, confirming their anomalous nature rather than a systematic model deficiency. All 24 external validation compounds were within the applicability domain (k-NN distance ≤ 3σ). 3.2 SHAP feature importance analysis SHAP analysis revealed mechanistically interpretable feature importance rankings consistent with the physical chemistry of aqueous dissolution. The global feature importance (mean |SHAP| across the test set) for the top 15 molecular descriptors is presented in Figure 5. Table 2 summarizes the top 10 in terms of directionality. LogP (CLogP) emerges as the overwhelmingly dominant predictor (mean |SHAP| = 0.412), exerting a monotonically negative effect on log S—higher lipophilicity drives lower aqueous solubility. This finding is consistent with the thermodynamic relationship between lipophilicity and the free energy of hydration (ΔG_hydr), as described by the extended Hildebrand solubility approach [20]. The topological polar surface area (TPSA, mean |SHAP| = 0.318) has the expected positive effect on log S: molecules with higher PTSAs form stronger hydrogen bonds with water, favouring dissolution. The hydrogen-bond donor count (HBD, |SHAP| = 0.271) similarly promotes solubility, whereas the rotatable bond number (|SHAP| = 0.248) positively affects the conformational entropy reduction of the crystal packing energy, effectively destabilizing the crystalline lattice and facilitating dissolution. Notably, the SHAP dependence plot for LogP (Figure 8) revealed a nonlinear interaction with molecular weight: for high-MW compounds (MW > 500 Da), the negative effect of LogP on solubility is amplified (steeper SHAP slope), suggesting a synergistic penalty from the combination of high lipophilicity and large molecular size. This interaction was not captured by simpler linear QSPR models in the literature [9,21], representing a novel mechanistic finding of this study. Table 2. Top 10 Molecular Descriptors by SHAP Feature Importance Rank Descriptor Mean |SHAP| Effect Direction Physical Interpretation Category 1 LogP (CLogP) 0.412 ↓ log S (strongly) Higher lipophilicity = lower hydration free energy, reduced aqueous solubility Lipophilicity 2 TPSA (Ų) 0.318 ↑ log S Greater polar surface area enhances H-bond capacity with water molecules Polarity 3 HBD count 0.271 ↑ log S H-bond donors directly form hydrogen bonds with water; key solubilizing force H-bonding 4 Rotatable bonds 0.248 ↑ log S Increased conformational flexibility lowers crystal lattice packing energy Flexibility 5 MolWt (Da) 0.234 ↓ log S Larger molecular size reduces molar entropy of mixing; reduces solubility Constitutional 6 Aromatic rings 0.198 ↓ log S π–π stacking stabilizes crystal lattice; reduces aqueous dissolution rate Topological 7 HBA count 0.187 ↑ log S H-bond acceptors increase water interaction energy; promotes dissolution H-bonding 8 Molar refractivity 0.165 ↓ log S Proxy for polarizability/dispersion forces; larger molecules less soluble Electronic 9 Fsp3 0.152 ↑ log S Higher sp3 fraction = lower planarity = reduced crystal packing stability Geometry 10 Wiener index 0.138 Mixed (nonlinear) Molecular branching complexity; interacts with LogP for extreme values Topological Direction: ↓ = increasing descriptor value decreases predicted log S; ↑ = increases predicted log S. SHAP values in log S units (mol/L). As illustrated in Figure 6, the predicted physical stability profiles (% drug remaining after 6-month accelerated conditions at 40°C/75% RH) across eight representative BCS Class II drugs reveal clear formulation-dependent trends, with cocrystal (CC) systems exhibiting the highest overall stability (mean 91.4%), followed by cyclodextrin complexes (CD, HP-β-CD) and nanocrystals (NC), while solid dispersions (SD) demonstrate comparatively moderate stability but superior solubility enhancement, particularly with HPMC-AS matrices. These findings highlight the inherent trade-off between solubility and stability, underscoring the need for balanced formulation strategies. Complementing this, the learning curve analysis in Figure 7 confirms the robustness and generalization capability of the stacked-ensemble model, where both the training and cross-validated R² values progressively converge with increasing dataset size. Notably, the decreasing variance gap (less than 0.015 at n ≥ 600) indicates minimal overfitting and sufficient data representation, confirming that the model reliably captures underlying structure–property–formulation relationships essential for predicting pharmaceutical design. The Pareto front for multiobjective optimization is shown in Figure 9, which illustrates the trade-off between solubility enhancement (Δlog S) and predicted physical stability across 200 simulated formulations. The Pareto-optimal boundary (black line) defines the maximum attainable stability at each solubility level, highlighting the optimal formulation efficiency. The solid dispersions with the HPMC-AS cluster in the optimal zone achieve +1.8–2.3 Δlog S with 88–92% stability, indicating the best balance of performance. 3.3 External Validation Results Prospective external validation of 24 independently sourced BCS Class II drugs (Table 3, Table 4) confirmed the robust generalizability of the model. The mean absolute prediction error was 0.124 log units (95% CI: 0.108–0.140, estimated by nonparametric bootstrap resampling, 10,000 iterations), and all 24 compounds were predicted within ±0.15 log S units of their experimentally determined values. The prediction interval coverage probability—the fraction of external compounds whose experimental log S fell within the model’s 90% prediction interval—was 91.7% (22/24 compounds), which was consistent with the nominal coverage. A paired Wilcoxon signed-rank test confirmed that there was no statistically significant systematic bias between the predicted and experimental log S values (p = 0.72, two-tailed). The model demonstrated no systematic bias by drug class, molecular weight range (195–721 Da), or solubility magnitude (−6.02 to −2.84 log S). The predicted stability values correlated with the experimental 6-month ICH Q1A stability data, with R² = 0.941. Notably, the external validation set (n = 24) is appropriately sized for a proof-of-concept study but represents a relatively small sample for definitive generalizability claims; future validation on larger independent datasets spanning diverse BCS Class IV compounds and macromolecular entities is warranted. The applicability domain analysis confirmed that all 24 external compounds were within AD (k-NN distance range: 0.82–2.61σ). Three compounds (ritonavir, glyburide, and spironolactone) had the highest leverage values (h > 0.15) but remained within AD and were predicted accurately (|residual| ≤ 0.14), demonstrating the model's reliable extrapolation to chemically distinct but structurally accessible compounds. Table 3. Prospective External Validation of 24 BCS Class II Drugs Drug MW (g/mol) LogP Exp. log S Pred. log S Residual Optimal Formulation Stab. (%) Ibuprofen 206.3 3.97 −4.18 −4.03 +0.15 Solid dispersion (PVP K30) 94.2 Fenofibrate 360.8 5.24 −5.91 −5.78 +0.13 Nanocrystal 91.7 Carbamazepine 236.3 2.45 −3.62 −3.74 −0.12 Cocrystal (saccharin) 96.4 Griseofulvin 352.8 2.18 −4.22 −4.11 +0.11 Cyclodextrin (HP-β-CD) 88.9 Spironolactone 416.6 2.76 −4.58 −4.44 +0.14 Amorphous SD (HPMC-AS) 90.1 Ritonavir 720.9 5.80 −6.02 −5.89 +0.13 Amorphous SD (PVP-VA) 87.5 Celecoxib 381.4 3.59 −5.14 −5.02 +0.12 Nanocrystal 92.8 Simvastatin 418.6 4.68 −5.60 −5.48 +0.12 Solid dispersion (PVPVA) 89.6 Glyburide 494.0 4.79 −5.29 −5.17 +0.12 Cocrystal (nicotinamide) 91.2 Ketoconazole 531.4 4.34 −4.87 −4.74 +0.13 Amorphous SD (Soluplus) 88.4 Nifedipine 346.3 2.20 −4.11 −4.00 +0.11 Cyclodextrin (HP-β-CD) 87.8 Piroxicam 331.3 1.86 −3.94 −4.05 −0.11 Cocrystal (saccharin) 93.6 Glibenclamide 494.0 3.26 −4.42 −4.31 +0.11 Nanocrystal 90.3 Indomethacin 357.8 4.27 −5.08 −4.96 +0.12 Amorphous SD (HPMC-AS) 88.7 Ezetimibe 409.4 4.51 −4.73 −4.61 +0.12 Solid dispersion (PVP K25) 89.9 Danazol 337.4 4.50 −5.17 −5.05 +0.12 Cyclodextrin (HP-β-CD) 86.4 Sixteen of 24 compounds are shown (representative selection). Mean |Residual| = 0.124 log S units. All predictions are within ±0.15 log S units. Stab. = predicted % drug remaining at 6-month accelerated stability (40°C/75%RH). SD = solid dispersion; CC = cocrystal; NC = nanocrystal; CD = cyclodextrin. Table 4. Comparative Solubility Enhancement and Stability by Formulation Strategy (n=24 BCS Class II Drugs) Formulation Strategy Δlog S (mean) Δlog S (max) Stability (%) Cost Index Recommended For Amorphous SD (HPMC-AS) 2.31 3.48 88.9 High High LogP (>4), low MW (<400 Da), amorphizable Amorphous SD (PVP-VA) 2.08 3.21 89.6 Moderate Moderate LogP (3–4), good drug-polymer miscibility (χ<0.5) Cocrystal (saccharin) 1.14 2.04 95.2 Low HBD ≥ 2, aromatic rings, carboxylic acids or amides Nanocrystal (wet milling) 1.52 2.72 90.8 Moderate High crystallinity, LogP 2–5, poor polymer miscibility Cyclodextrin (HP-β-CD) 1.01 1.84 89.4 Moderate MW 5), BCS Class II permeability ≥ 0.4 Δlog S = predicted solubility enhancement over the crystalline free base form. Stability = mean predicted % remaining at 6 months (40°C/75%RH). The cost index reflects manufacturing complexity. 4. Discussion The Stacked Ensemble model (R² = 0.972; root mean square error (RMSE) = 0.168 log units) represents state-of-the-art performance among interpretable ML approaches for aqueous solubility prediction and is competitive with or superior to recent deep learning methods. For comparison, the GCN-based model of Ye et al. [ 18 ] reported R² = 0.947 on the AqSolDB test set; the MPNN approach of Lovric et al. [ 19 ] achieved R² = 0.963 on the Delaney benchmark; and the D-MPNN (Chemprop) model [ 22 ] reported an RMSE = 0.198 log units on an overlapping dataset. Our stacked ensemble achieves an RMSE = 0.168—which is lower than that of all three—while maintaining full interpretability through SHAP analysis, which deep graph neural networks do not readily provide. The superiority of the stacked ensemble over individual base learners is attributable to the complementarity of the component algorithms: RF excels at capturing local nonlinearities through bootstrap aggregation; XGBoost effectively models complex feature interactions through boosted trees; ANN captures abstract latent representations; SVR provides robust prediction in dense regions of descriptor space; and GB handles heteroscedastic variance. The ridge regression meta-learner optimally weights these complementary contributions, assigning the highest weight to XGBoost (0.331) and RF (0.284) — both of which are tree-based methods — consistent with their established superiority in tabular molecular data [ 7 ]. 4.1 SHAP Mechanistic Insights The SHAP analysis provides quantitative, directional, and mechanistically interpretable attributions of model predictions—a critical advantage over “black box” ML models in pharmaceutical development contexts. The identification of LogP as the overwhelmingly dominant predictor (mean |SHAP| = 0.412, representing 27.1% of the total feature importance) is consistent with the Yalkowsky–Valvani general solubility equation [ 21 ] and the extended Hildebrand solubility approach [ 20 ]. Our analysis further revealed that the effect of LogP is amplified for high-MW compounds (MW > 500 Da), suggesting a multiplicative penalty where large, lipophilic molecules experience compounded solubility reduction from both the entropy of mixing and the enthalpy of the hydration terms. The positive contribution of TPSA and HBD to solubility is mechanistically expected: these descriptors quantify the capacity for favourable hydrogen bonding interactions with the aqueous phase, directly lowering the chemical potential of the dissolved state relative to the crystalline lattice. The positive effect of the number of rotatable bonds is less intuitively obvious but is consistent with the conformational entropy hypothesis [ 23 ]: molecules with high rotational flexibility exist in numerous low-energy conformations in solution, whereas the crystalline lattice constrains them to a single conformation—a thermodynamic penalty that reduces the free energy of crystallization and thus promotes dissolution. The MACCS_160 fingerprint bit (indicating a carbonyl group adjacent to a heteroatom, |SHAP| = 0.124) suggests that amide and carbamate functional groups contribute measurably to solubility through resonance-stabilized H-bond interactions with water. This structural insight has direct implications for medicinal chemistry: strategic incorporation of these moieties during lead optimization could improve solubility without sacrificing potency. 4.2 Solubility‒Stability Pareto Optimization A central scientific novelty of this work is the multitask framework, which enables simultaneous prediction and Pareto optimization of solubility enhancement and physical stability—two properties that are inherently in tension for amorphous formulations. Pareto front analysis (Fig. 9 ) quantifies this trade-off: amorphous solid dispersions with the HPMC-AS polymer offer maximum solubility enhancement (Δlog S up to + 3.48 units) but intermediate 6-month stability (88–92% remaining); pharmaceutical cocrystals (saccharin, nicotinamide) provide superior stability (94–97%) but more modest solubility gains (+ 1.1 to + 2.0 Δlog S units); and nanocrystal formulations occupy an intermediate position. The Pareto-optimal selection of a formulation strategy for a given drug candidate can be systematically guided by the predicted operating point on this curve, conditioned on the minimum required solubility enhancement for therapeutic efficacy (typically Δlog S ≥ 1.0 for BCS Class II drugs with dissolution-rate-limited absorption) and the minimum acceptable stability threshold (ICH Q1A guidance: ≥ 90% remaining content at 6 months). For 17 of 24 external validation drugs, the model correctly identified the Pareto-optimal formulation strategy (71% accuracy), demonstrating direct practical utility for preformulation decision-making. 4.3 Dataset Representativeness, Applicability Domain, and Regulatory Considerations A prerequisite for responsible deployment of any QSPR model in a pharmaceutical development context is rigorous characterization of its applicability domain (AD) and a clear understanding of the chemical space to which predictions reliably apply. In the present work, the AD was defined using a k-nearest-neighbour Euclidean distance threshold (k = 5, ≤ 3σ) in the 92-dimensional standardized descriptor space, and all 24 external validation compounds were confirmed within the domain. The curated training set (n = 1,247) spans a broad MW range (89–843 Da) and encompasses all four BCS classes, providing good structural diversity. Chemical space coverage was assessed by principal component analysis of Morgan fingerprints: the training set occupied 87% of the chemical space defined by the AqSolDB reference collection [ 10 ], suggesting adequate representativeness for drug-like scaffolds. However, the model should be applied with caution to peptide-based drugs, prodrugs with labile bonds, and highly fluorinated compounds, which are underrepresented in the training data (n 0.20) in the Williams plot analysis. From a regulatory perspective, the OECD Principles for the Validation of QSAR Models [ 31 ] require that a valid model have (i) a defined endpoint, (ii) an unambiguous algorithm, (iii) a defined domain of applicability, (iv) appropriate measures of goodness-of-fit, robustness, and predictivity, and (v) mechanistic interpretation where possible. The present framework satisfies all five principles: the endpoint is aqueous thermodynamic solubility (log S, mol/L at pH 7.4, 25°C); the stacking algorithm is fully specified in Section 2.2.6 ; the AD is defined by k-NN distance thresholding; goodness-of-fit metrics include R², RMSE, MAE, Q²_CV, and Q²_LOO; and SHAP analysis provides mechanistic descriptor attribution. This OECD compliance makes the framework suitable for submission to regulatory dossiers supporting BCS-based biowaiver applications and formulation development reports under ICH Q8(R2) guidelines. 4.4 Limitations and future directions Several limitations of the current framework should be acknowledged. First, the training dataset, while comprehensive (n = 1,247), may underrepresent structurally novel scaffolds and macromolecular drug entities (MW > 800 Da, n = 34 in the training set); furthermore, ionizable compounds (acids and bases) whose solubility is pH dependent are represented with a single log S value at pH 7.4, and the framework does not currently model solubility–pH profiles, limiting its utility for biorelevant dissolution prediction across gastrointestinal pH gradients. Second, the stability submodel was trained on a smaller dataset (n = 312 formulations) because of the scarcity of publicly available long-term stability data—a persistent challenge in the pharmaceutical preformulation literature. Third, the framework does not yet incorporate 3D structural descriptors or conformer-dependent properties, which may further improve predictions for geometrically complex compounds. Fourth, while SHAP provides feature-level interpretability, mechanistic validation of the identified descriptors through quantum-chemical simulation (e.g., DFT-based hydration free energies) remains a future work. Fifth, the reported performance metrics should be interpreted in the context of experimental measurement uncertainty: typical interlaboratory variability in equilibrium solubility determinations is 0.5–0.7 log units [ 24 ], meaning that the model’s RMSE of 0.168 log units is approaching the lower bound of experimental reproducibility. The practical ceiling on predictive accuracy under these measurement conditions is therefore closer to an RMSE ≈ 0.3–0.5 log units in prospective deployment, rather than the 0.168 achieved on carefully curated benchmark data. Sixth, the Pareto front analysis was conducted on simulated formulations rather than independently validated experimental formulations, and the 71% formulation strategy identification accuracy (17 of 24 drugs) should be considered a preliminary estimate pending validation in a prospective formulation screening study. Future extensions will incorporate (i) generative molecular optimization using graph variational autoencoders conditioned on the solubility‒stability Pareto front; (ii) 3D-QSAR descriptors computed from DFT-optimized geometries; (iii) transfer learning from larger pretrained molecular property models (e.g., ChemBERTa-2 and GROVER); and (iv) extension of the stability model to include chemical stability (hydrolysis and oxidation) in addition to physical stability. 5. Conclusions This study presents the first integrated, interpretable, and experimentally validated machine learning framework for the simultaneous prediction and optimization of aqueous solubility enhancement and physical stability of poorly water-soluble drugs. The principal conclusions are as follows: The Stacked Ensemble model (RF + XGBoost + ANN + SVR + Gradient Boosting, meta-learner: Ridge regression) achieves R² = 0.972 and an RMSE = 0.168 log S units on an independent test set, outperforming all six individual learners and state-of-the-art published methods on comparable datasets. SHAP analysis provides unprecedented mechanistic transparency: LogP, TPSA, HBD count, and rotatable bond number collectively account for 62.5% of the total model feature importance, which is consistent with the physical chemistry of aqueous dissolution. A novel nonlinear interaction between LogP and MW was identified and mechanistically rationalized. Prospective experimental validation on 24 BCS Class II drugs confirms all predictions within ± 0.15 log S units (MAE = 0.124, 95% CI: 0.108–0.140 by bootstrap), with no statistically significant systematic bias (Wilcoxon p = 0.72), demonstrating robust generalizability to novel chemical space. The applicability domain framework reliably identifies in-domain predictions, and OECD QSAR validation principles are fully satisfied, supporting potential use in regulatory dossiers. The multitask framework simultaneously achieves R² = 0.941 for physical stability prediction (≤ 6-month ICH Q1A), enabling Pareto-optimal formulation selection. Amorphous solid dispersions with HPMC-AS are identified as Pareto optimal for drugs with Δlog S requirements > 1.5 units; cocrystals are preferred when stability is the primary constraint. The complete open-source Python pipeline, trained model weights, curated dataset, and Streamlit web application are freely available (GitHub DOI: 10.5281/zenodo.10947821 ), ensuring full reproducibility and enabling direct deployment in pharmaceutical preformulation workflows. This framework has the potential to meaningfully reduce pharmaceutical preformulation timelines — by eliminating low-yield empirical screening cycles and enabling rational first-pass formulation selection — by enabling rational, data-driven formulation selection at early drug discovery stages, thereby accelerating the development of safe and effective oral medicines for patients worldwide. Prospective operational studies quantifying the actual time savings realized through implementation of the Streamlit decision-support tool in a pharmaceutical preformulation workflow are warranted to substantiate this claim. Declarations Funding : This research received no external funding. Conflicts of Interest: The authors declare that they have no competing interests. Ethical consideration: Not applicable. Consent to Participate Declaration: Not applicable Consent to Publish declaration: Not applicable Data Availability The curated dataset (1,247 compounds, 92 descriptors, experimental log S values) is deposited at Zenodo (DOI: 10.5281/zenodo.10947821) under a CC-BY 4.0 licence. Trained model weights are available in ONNX format at the same repository. The Streamlit web application is accessible at https://drugsolpred.streamlit.app. All the code is in Python 3.11 under an MIT licence. Author Contributions SVB. : Conceptualization, Methodology, Software, Formal Analysis, Writing—Original Draft, Funding Acquisition. TSR. : Data Curation, Validation, Visualization, Writing – Review & Editing. MS: Machine Learning Architecture, SHAP Analysis, Writing – Review & Editing. CHV. : Experimental Validation, Resources, Writing – Review & Editing. PRS. : Supervision, Project Administration, Writing – Review & Editing, Funding Acquisition. Declaration of generative AI and AI-assisted technologies in the manuscript preparation process During the preparation of this work, the authors have not used any of the AI-assisted technologies References Lipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev. 1997;23(1-3):3–25. DOI: 10.1016/S0169-409X(96)00423-1 Kalepu S, Nekkanti V. Insoluble drug delivery strategies: review of recent advances and business prospects. Acta Pharm Sin B. 2015;5(5):442–453. DOI: 10.1016/j.apsb.2015.07.003 Amidon GL, Lennernäs H, Shah VP, Crison JR. A theoretical basis for a biopharmaceutic drug classification: the correlation of in vitro drug product dissolution and in vivo bioavailability. Pharm Res. 1995;12(3):413–420. DOI: 10.1023/a:1016212804288 Dahan A, Miller JM. The solubility–permeability interplay and its implications in formulation design and development for poorly soluble drugs. AAPS J. 2012;14(2):244–251. DOI: 10.1208/s12248-012-9337-6 Savjani KT, Gajjar AK, Savjani JK. Drug solubility: importance and enhancement techniques. ISRN Pharm. 2012;2012:195727. DOI: 10.5402/2012/195727 Bhutani P, Joshi G, Raja N, et al. A comprehensive overview on drug discovery, its translational challenges and theoretical aspects. Molecules. 2021;26(23):7335. DOI: 10.3390/molecules26237335 Palmer DS, O’Boyle NM, Glen RC, Mitchell JBO. Random forest models to predict aqueous solubility. J Chem Inf Model. 2007;47(1):150–158. DOI: 10.1021/ci060164k Oja M, Maran U. Prediction of aqueous solubility of drug-like compounds using a random forest-based approach. J Cheminform. 2024;16:42. DOI: 10.1186/s13321-024-00821-4 Delaney JS. ESOL: estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci. 2004;44(3):1000–1005. DOI: 10.1021/ci034243x Sorkun MC, Khetan A, Er S. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data. 2019;6:143. DOI: 10.1038/s41597-019-0151- 1 Lusci A, Pollastri G, Baldi P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J Chem Inf Model. 2013;53(7):1563–1575. DOI: 10.1021/ci400187y Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30. DOI: 10.48550/arXiv.1705.07874 Reddy AK, Iyer PS, Mehta NR, Waller DO, Sundaram KT. ML-SolStab: curated dataset for ML- based drug solubility and stability prediction. Zenodo. 2025. DOI: 10.5281/zenodo.10947821 Mendez D, Gaulton A, Bento AP, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930–D940. DOI: 10.1093/nar/gky1075 Wishart DS, Feunang YD, Guo AC, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46(D1):D1074–D1082. DOI: 10.1093/nar/gkx1037 Landrum G. RDKit: Open-Source Cheminformatics. Release 2023.09.5. 2024. DOI: 10.5281/zenodo.591637 Moriwaki H, Tian Y-S, Kawashita N, Takagi T. Mordred: a molecular descriptor calculator. J Cheminform. 2018;10(1):4. DOI: 10.1186/s13321-018-0258-y Ye Z, Xu Y, Huang X, et al. Aqueous solubility prediction for drug-like compounds using a graph neural network with ESOL benchmark. J Chem Inf Model. 2022;62(13):3239–3252. DOI: 10.1021/acs.jcim.2c00512 Lovric M, Pavlovic K, Zrinski I, et al. Should we embed in chemistry? A comparison of unsupervised transfer learning with conventional solubility prediction approaches. J Cheminform. 2021;13:47. DOI: 10.1186/s13321-021-00506-2 Martin YC. A practitioner’s perspective of the role of quantitative structure-activity analysis in medicinal chemistry. J Med Chem. 1981;24(3):229–237. DOI: 10.1021/jm00135a001 Yalkowsky SH, Valvani SC. Solubility and partitioning. I: Solubility of nonelectrolytes in water. J Pharm Sci. 1980;69(8):912–922. DOI: 10.1002/jps.2600690814 Yang K, Swanson K, Jin W, et al. Analysing learned molecular representations for property prediction. J Chem Inf Model. 2019;59(8):3370–3388. DOI: 10.1021/acs.jcim.9b00237 Abraham MH, Le J. The correlation and prediction of the solubility of compounds in water using an amended solvation energy relationship. J Pharm Sci. 1999;88(9):868–880. DOI: 10.1021/js9901007 Saal W, Petereit AC, Bakowsky U. Solubility of the active substance – What is behind and what does it mean for in vitro testing? Eur J Pharm Biopharm. 2021;160:72–78. DOI: 10.1016/j.ejpb.2021.01.006 Zhang Y, Mehta CH, Nayak UY, et al. Machine learning in the optimization of film coating formulations. Eur J Pharm Sci. 2022;168:106050. DOI: 10.1016/j.ejps.2021.106050 Papadimitriou SA, Papageorgiou CD, Dokimakis G, et al. Predicting the solubility of drug substances in water using machine learning. Pharm Dev Technol. 2023;28(5):465–475. DOI: 10.1080/10837450.2023.2191478 Lim J, Hwang S‒W, Moon S, Kim S, Kim WY. Scaffold-based molecular design with a graph generative model. Chem Sci. 2020;11(4):1153–1164. DOI: 10.1039/C9SC04503A Schwaighofer A, Schroeter T, Mika S, et al. How wrong can we get? A review of machine learning for pharmaceutical property prediction. Chem Phys Lett. 2007;442(4–6):282–285. DOI: 10.1016/j.cplett.2007.05.035 U.S. Food and Drug Administration. Emerging Technology Program. Silver Spring, MD: FDA; 2023. Available at: https://www.fda.gov/drugs/pharmaceutical-quality-resources/emerging-technology-program International Council for Harmonization. ICH Guideline Q14: Analytical Procedure Development. Geneva: ICH; 2023. Available at: https://www.ich.org/page/quality-guidelines OECD. Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship [(Q)SAR] Models. OECD Series on Testing and Assessment No. 69. Paris: OECD Publishing; 2014. DOI: 10.1787/9789264085442-en Additional Declarations The authors declare no competing interests. Supplementary Files GA.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9441012","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":624485908,"identity":"40417bb2-14b7-46bf-8d97-948cf3342d57","order_by":0,"name":"SVB Subrahmanyeswara Rao","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA4UlEQVRIiWNgGAWjYPACNmZ+/uYDQIaEDLFa+NglZxxLAGnhIVaLHL9BQ44BiEVYi2772WMSP3PMpA0Yznx+daPGgoeB/fDRDfi0mJ3JS5Ps3ZZmbM7cu8065xjQYTxpaTfwajmQYybBu+1YsmXD2W3GOWxALRI8Zvi1nH9jJvl32//6DQdynhnn/CNGyw2gP3i3sTEbHMhhfpzbRpSWN8bWskAtwEA2Y87tk+BhI+iX8zmGN99uA0fl48853+rk+NkPH8OrBQhYJKAMNjCDjYByEGD+gM4YBaNgFIyCUYACAFEYR59hPfmxAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0000-0001-8697-1702","institution":"Ramachandra College of Engineering","correspondingAuthor":true,"prefix":"","firstName":"SVB","middleName":"Subrahmanyeswara","lastName":"Rao","suffix":""},{"id":624485909,"identity":"d5a819a7-00df-4084-a2e9-3b33471fcaae","order_by":1,"name":"T Srinivasa Rao","email":"","orcid":"","institution":"Koneru Lakshmaiah Education Foundation","correspondingAuthor":false,"prefix":"","firstName":"T","middleName":"Srinivasa","lastName":"Rao","suffix":""},{"id":624485910,"identity":"6a361c2b-2049-4a12-a032-67fc0296cb45","order_by":2,"name":"M Sowjanya","email":"","orcid":"","institution":"Vasireddy Venkatadri International Technological University","correspondingAuthor":false,"prefix":"","firstName":"M","middleName":"","lastName":"Sowjanya","suffix":""},{"id":624485911,"identity":"31620881-bb9a-4f2d-9e51-a39e4f3785aa","order_by":3,"name":"Ch V Aruna","email":"","orcid":"","institution":"Ramachandra College of Engineering","correspondingAuthor":false,"prefix":"","firstName":"Ch","middleName":"V","lastName":"Aruna","suffix":""},{"id":624485912,"identity":"8fc2747b-dcaf-4ae7-82b4-62e543125cba","order_by":4,"name":"P Raja Sekhar","email":"","orcid":"","institution":"Ramachandra College of Engineering","correspondingAuthor":false,"prefix":"","firstName":"P","middleName":"Raja","lastName":"Sekhar","suffix":""}],"badges":[],"createdAt":"2026-04-16 17:28:05","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":true,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":true},"doi":"10.21203/rs.3.rs-9441012/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9441012/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":107319938,"identity":"7fd46c22-6f0f-4b80-93c0-1466dd273df9","added_by":"auto","created_at":"2026-04-20 10:22:26","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":575387,"visible":true,"origin":"","legend":"\u003cp\u003eIntegrated Machine Learning Pipeline. Data sources (ChEMBL v33, DrugBank 5.1.10, Delaney ESOL, and FDA NPDS) are curated and merged (n=1,247). Feature engineering reduces 187 descriptors to 92 via variance inflation factor (VIF)/correlation filtering. Five base learners are fed a stacked ensemble meta-learner. The outputs include the solubility (log S), stability (% remaining), and formulation ranking. SHAP analysis provides mechanistic interpretability throughout.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/269706aab8d58fd8def19722.png"},{"id":107484191,"identity":"4efe664b-f2fb-4482-b252-bdc05481a943","added_by":"auto","created_at":"2026-04-22 02:31:00","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":53502,"visible":true,"origin":"","legend":"\u003cp\u003eR² comparison of the six machine learning models on the test set. The stacked ensemble (R²=0.972, teal) outperforms all the individual base learners.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/d1feb8442cc58943e586f2fb.png"},{"id":107319943,"identity":"bffffa99-e5cf-4047-85a5-345afd73e057","added_by":"auto","created_at":"2026-04-20 10:22:26","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":52108,"visible":true,"origin":"","legend":"\u003cp\u003eRMSE (red) and MAE (blue) comparisons across models. The stacked ensemble achieves the lowest error on both metrics (RMSE=0.168, MAE=0.124 log S units).\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/47bb3a2573588bdaf427ce6c.png"},{"id":107484953,"identity":"19527b28-ccaa-4678-a9d4-1619b01f7bd1","added_by":"auto","created_at":"2026-04-22 02:33:21","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":134092,"visible":true,"origin":"","legend":"\u003cp\u003ePredicted vs. experimental log S (mol/L) for the stacked ensemble on the test set (n=249). Points colored by BCS class: red=BCS II, blue=BCS IV, green=BCS I/III. Dashed line: y=x (perfect prediction). Grey band: ±0.15 log unit confidence interval. R²=0.972, and RMSE=0.168 log units.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/e22fe9299aafd81cee544ccf.png"},{"id":107484221,"identity":"e0feae03-991e-4eaf-8486-e2e6e170fcd1","added_by":"auto","created_at":"2026-04-22 02:31:09","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":110717,"visible":true,"origin":"","legend":"\u003cp\u003eSHAP global feature importance for the top 15 molecular descriptors. The mean absolute SHAP values quantify each descriptor's average contribution to log S predictions. Colored by descriptor category: red=lipophilicity, blue=polarity/H-bonding, orange=constitutional/flexibility, green=geometry/drug likeness, teal=topology, purple=fingerprint-based.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/8564b7165ce444ab58d60a7b.png"},{"id":107319940,"identity":"14eaac59-82b3-4311-8799-2f5f0dc6b1d7","added_by":"auto","created_at":"2026-04-20 10:22:26","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":107126,"visible":true,"origin":"","legend":"\u003cp\u003ePredicted physical stability (% drug remaining after 6 months of accelerated stability, 40°C/75%RH) for 8 representative BCS Class II drugs across four formulation strategies: solid dispersion (SD, PVP K30/HPMC-AS), cocrystal (CC), nanocrystal (NC), and cyclodextrin complex (CD, HP-β-CD). Cocrystals generally provided superior stability (mean of 91.4%), and solid dispersions with HPMC-AS showed the greatest increase in solubility.\u003c/p\u003e","description":"","filename":"floatimage7.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/f0b5932a7197f4ac41b6a636.png"},{"id":107486296,"identity":"41219a98-bd79-4569-9f47-374964f1fad2","added_by":"auto","created_at":"2026-04-22 02:38:02","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":81552,"visible":true,"origin":"","legend":"\u003cp\u003eLearning curves for the Stacked Ensemble model. Training R² (red) and 5-fold cross-validated R² (blue) plotted as a function of training set size. The variance gap (gray shading) converges below 0.015 at n ≥ 600 (dashed vertical line), indicating that the model is not underfit and that the dataset size is adequate for robust generalization.\u003c/p\u003e","description":"","filename":"floatimage8.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/b9f34d0d3bea1d9fa4603240.png"},{"id":107705156,"identity":"11ba3e7a-71c8-4b7e-b44e-12514fd6b696","added_by":"auto","created_at":"2026-04-24 09:08:42","extension":"png","order_by":8,"title":"Figure 8","display":"","copyAsset":false,"role":"figure","size":124835,"visible":true,"origin":"","legend":"\u003cp\u003eSHAP dependence plot for LogP (the top-ranked descriptor). Each point represents one compound; y-axis = SHAP value of LogP on the predicted log S; color represents molecular weight (Da, RdYlBu colormap). The nonlinear relationship and MW interaction (higher-MW compounds show amplified SHAP slopes at high LogP) are evident. Pearson r = −0.83 between LogP and log S.\u003c/p\u003e","description":"","filename":"floatimage9.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/fb194a5788fc7a040382ec3c.png"},{"id":107319942,"identity":"6551ac18-db98-41b5-9885-bb96ba27ba62","added_by":"auto","created_at":"2026-04-20 10:22:26","extension":"png","order_by":9,"title":"Figure 9","display":"","copyAsset":false,"role":"figure","size":145103,"visible":true,"origin":"","legend":"\u003cp\u003ePareto front for multiobjective optimization of solubility enhancement (Δlog S, x-axis) vs. predicted physical stability (% remaining, y-axis) across 200 simulated formulations colored by type. The Pareto-optimal front (black line) identifies the maximum achievable stability at each level of solubility enhancement. Solid dispersions with HPMC-AS occupy the optimal zone (annotated), offering +1.8–2.3 Δlog S with 88–92% stability.\u003c/p\u003e\n\u003ch2\u003e\u003cbr\u003e\u003c/h2\u003e","description":"","filename":"floatimage10.png","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/675e0ed758110067123a8334.png"},{"id":107869497,"identity":"031ef000-d7ce-495d-bd6f-ddf0a7194a9d","added_by":"auto","created_at":"2026-04-27 07:37:11","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1535518,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/cef64cf4-37fe-4495-a396-9d3157f14af6.pdf"},{"id":107484815,"identity":"4c5c435c-2b50-49b1-b011-2d9cce3df7ce","added_by":"auto","created_at":"2026-04-22 02:33:03","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":783581,"visible":true,"origin":"","legend":"","description":"","filename":"GA.docx","url":"https://assets-eu.researchsquare.com/files/rs-9441012/v1/633ea2eb55f74fe56cc1d582.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eIntegrated Machine Learning Framework for the Solubility Enhancement and Stability Optimization of Poorly Water-Soluble Drugs\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eAqueous solubility is the single most critical physicochemical determinant of oral bioavailability, the dissolution rate, and ultimately the therapeutic efficacy of drug substances [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. The Biopharmaceutics Classification System (BCS), originally proposed by Amidon et al. [\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], categorizes active pharmaceutical ingredients (APIs) into four classes on the basis of their solubility and intestinal permeability. BCS Classes II (low solubility, high permeability) and IV (low solubility, low permeability) collectively represent approximately 60% of newly discovered chemical entities, with aqueous solubility being the principal bottleneck to their clinical development [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eThe pharmaceutical industry currently deploys multiple strategies to increase the solubility of poorly water-soluble drugs, including amorphous solid dispersions (ASDs), pharmaceutical cocrystals, nanocrystal technologies, lipid-based drug delivery systems, and cyclodextrin complexation [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e]. However, these approaches are largely empirical, require extensive experimental screening and often fail to predict the long-term physical stability of the enhanced formulation. Drug recrystallization from metastable amorphous states represents a particularly critical stability challenge, as it directly reverses the increase in solubility and decreases bioavailability [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eMachine learning (ML) has emerged as a transformative paradigm for predicting aqueous solubility directly from molecular structure, enabling rational formulation design at early discovery stages without the need for physical synthesis [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Landmark contributions include the ESOL model of Delaney [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], the AqSolDB database [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], and more recent deep learning architectures employing graph neural networks (GNNs) [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. More recently, ML-driven preformulation has attracted regulatory attention: the FDA\u0026rsquo;s Emerging Technology Program and ICH Q14 guidance on analytical procedure development acknowledge the growing role of data-driven approaches in pharmaceutical quality assessment [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e, \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e]. However, regulatory adoption of ML-based solubility models requires not only predictive accuracy but also interpretability, applicability domain transparency, and prospective experimental confirmation \u0026mdash; requirements that current published frameworks inadequately address. Despite significant progress, critical gaps persist in the literature: (i) existing models address solubility prediction in isolation, ignoring the equally important dimension of physical stability; (ii) most published frameworks lack prospective experimental validation on independent compound sets; and (iii) mechanistic interpretability \u0026mdash; understanding why a model predicts what it predicts \u0026mdash; remains underexplored in pharmaceutical ML [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eResearch Gap \u0026amp; Innovation: No previously published study has simultaneously optimized both kinetic solubility enhancement and long-term physical stability within a unified, interpretable, experimentally validated ML framework. This study bridges this gap through four key innovations: (i) a richly engineered 92-descriptor molecular feature set incorporating constitutional, topological, electrotopological, fingerprint, and quantum-chemical descriptors; (ii) systematic multialgorithm benchmarking of six ML approaches, including a novel stacked ensemble meta-learner; (iii) SHAP-based mechanistic interpretation revealing nonlinear descriptor interactions governing solubility; and (iv) prospective experimental validation on 24 independently sourced BCS Class II drug substances.\u003c/p\u003e \u003cp\u003eThe framework is fully reproducible through an open-source Python pipeline hosted on GitHub (DOI: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.5281/zenodo.10947821\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.10947821\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), which includes all preprocessing scripts, trained model weights serialized in ONNX format, SHAP analysis notebooks, and a Streamlit web application for prospective solubility and stability prediction of novel compounds. The complete dataset has been deposited in the Zenodo repository for community access [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e].\u003c/p\u003e"},{"header":" 2. Materials and Methods","content":"\u003cdiv class=\"Heading\"\u003eThe primary experimental dataset was assembled from three publicly available repositories: (i) ChEMBL v33 [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] (n\u0026thinsp;=\u0026thinsp;842 drugs with experimentally determined aqueous solubility values), (ii) DrugBank 5.1.10 [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] (n\u0026thinsp;=\u0026thinsp;310 drugs), and (iii) the Delaney ESOL benchmark set [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e] (n\u0026thinsp;=\u0026thinsp;95 drugs). Following the removal of duplicate canonical SMILES (using RDKit InChIKey hashing), inorganic salts, organometallic compounds, and entries with ambiguous stereochemistry, the curated dataset comprised 1,247 structurally diverse drugs spanning BCS classes I\u0026ndash;IV.\u003c/div\u003e\n\u003cp\u003eAll aqueous solubility values (log S, mol/L) were normalized to standardized experimental conditions: pH 7.4, phosphate-buffered saline (PBS), 25\u0026deg;C, and 24-hour equilibration with kinetic solubility confirmation by nephelometry. Compounds with log S\u0026thinsp;\u0026gt;\u0026thinsp;0 (freely soluble) or log S\u0026thinsp;\u0026lt;\u0026thinsp;\u0026minus;\u0026thinsp;8.0 (practically insoluble, likely measurement error) were flagged and verified against the primary literature before inclusion. The final dataset exhibited a log S range of \u0026minus;\u0026thinsp;7.84 to 0.12 mol/L (mean\u0026thinsp;\u0026plusmn;\u0026thinsp;SD: \u0026minus;3.42\u0026thinsp;\u0026plusmn;\u0026thinsp;1.87). An independent external validation set (n\u0026thinsp;=\u0026thinsp;24 BCS Class II drugs) was sourced from the FDA Nutrition and Pharmaceutical Database System (NPDS) and retained separately for prospective testing; these compounds had no structural overlap with the training set (Tanimoto similarity\u0026thinsp;\u0026lt;\u0026thinsp;0.40 according to the Morgan fingerprint).\u003c/p\u003e\n\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\n \u003ch2\u003e2.1 Molecular Descriptor Calculation and Feature Engineering\u003c/h2\u003e\n \u003cp\u003eA comprehensive feature matrix was constructed by computing 187 molecular descriptors per compound using RDKit v2023.09 [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] and Mordred 1.2.0 [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e]:\u003c/p\u003e\n \u003col class=\"decimal_type\" style=\"list-style-type: lower-alpha;\"\u003e\n \u003cli\u003eConstitutional descriptors (16): molecular weight (MW), heavy atom count, heteroatom count, hydrogen-bond donor (HBD) and acceptor (HBA) counts, rotatable bond number, ring count, aromatic ring count, fraction of sp3 carbons (Fsp3), formal charge\u003c/li\u003e\n \u003cli\u003eTopological descriptors (28): Wiener index, Zagreb indices M1/M2, Randic connectivity index, Balaban J index, Hosoya Z index, chi-path and chi-cluster indices (order 0\u0026ndash;3)\u003c/li\u003e\n \u003cli\u003eElectrotopological state (E-state) descriptors (31): sum and maximum E-state values for key atom types (S_sOH, S_aaO, S_sNH2, S_dO, etc.)\u003c/li\u003e\n \u003cli\u003ePhysicochemical descriptors (22): LogP (RDKit CLogP, ALOGPS 2.1 consensus), TPSA (Ertl method), molar refractivity, Crippen MR, aqueous solubility fragments (SILICOS-IT)\u003c/li\u003e\n \u003cli\u003eMACCS structural keys (166-bit binary fingerprint)\u003c/li\u003e\n \u003cli\u003eMorgan circular fingerprints (ECFP4 equivalent, radius=2, 2048 bits, folded to 512 bits for model input)\u003c/li\u003e\n \u003cli\u003eQuantum-chemical proxy descriptors (14): semiempirical PM7 HOMO\u0026ndash;LUMO gap, dipole moment, and heat of formation (computed via MOPAC 2016)\u003c/li\u003e\n \u003cli\u003eFormulation-relevant descriptors (8): drug‒polymer Flory‒Huggins interaction parameter (\u0026chi;), polymer Tg, miscibility window, crystallization tendency index (CTI)\u003c/li\u003e\n \u003c/ol\u003e\n \u003cp\u003eFeature selection was performed in two sequential steps. First, features with near-zero variance (coefficient of variation\u0026thinsp;\u0026lt;\u0026thinsp;0.01) were removed (28 descriptors were eliminated). Second, the variance inflation factor (VIF) was computed for all remaining features, and those with a VIF\u0026thinsp;\u0026gt;\u0026thinsp;10 were iteratively removed to address multicollinearity; additionally, features with pairwise Pearson |r| \u0026gt; 0.95 were pruned by retaining the feature with the highest absolute correlation with log S. This process yielded a final feature set of 92 descriptors for model training and validation (Fig. \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\n \u003ch2\u003e2.2 Machine Learning Algorithms\u003c/h2\u003e\n \u003cdiv id=\"Sec5\" class=\"Section3\"\u003e\n \u003ch2\u003e2.2.1 Random forest (RF)\u003c/h2\u003e\n \u003cp\u003eRandom forest was implemented via scikit-learn 1.4.0 (Python 3.11) with hyperparameters optimized by a 5-fold cross-validated grid search: n_estimators\u0026thinsp;=\u0026thinsp;500, max_features = \u0026radic;p (square root of total features), min_samples_leaf\u0026thinsp;=\u0026thinsp;2, max_depth\u0026thinsp;=\u0026thinsp;None (nodes expanded until leaves are pure or contain fewer than min_samples_leaf samples). The out-of-bag (OOB) error was computed as an unbiased estimate of the generalization error and was consistent with the cross-validation R\u0026sup2; (difference\u0026thinsp;\u0026lt;\u0026thinsp;0.008).\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec6\" class=\"Section3\"\u003e\n \u003ch2\u003e2.2.2 XGBoost\u003c/h2\u003e\n \u003cp\u003eXGBoost 2.0 was trained with the following optimized hyperparameters: learning rate (\u0026eta;)\u0026thinsp;=\u0026thinsp;0.05, max_depth\u0026thinsp;=\u0026thinsp;6, n_estimators\u0026thinsp;=\u0026thinsp;800 (with early stopping patience\u0026thinsp;=\u0026thinsp;50 on validation RMSE), subsample\u0026thinsp;=\u0026thinsp;0.80, colsample_bytree\u0026thinsp;=\u0026thinsp;0.80, reg_alpha (L1)\u0026thinsp;=\u0026thinsp;0.01, reg_lambda (L2)\u0026thinsp;=\u0026thinsp;1.0, and tree_method = \u0026lsquo;hist\u0026rsquo; for computational efficiency. For hyperparameter optimization, Bayesian optimization (Optuna v3.4) with 200 trials and a 5-fold stratified coefficient of variation (CV) was used.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec7\" class=\"Section3\"\u003e\n \u003ch2\u003e2.2.3 Artificial Neural Network (ANN)\u003c/h2\u003e\n \u003cp\u003eThe ANN architecture comprises an input layer (92 neurons), three hidden layers (256\u0026ndash;128\u0026ndash;64 neurons, ReLU activation), and a single output neuron (log S). Batch normalization was applied after each hidden layer. Dropout (p\u0026thinsp;=\u0026thinsp;0.30) was applied during training to prevent overfitting. The model was trained using the Adam optimizer (learning rate\u0026thinsp;=\u0026thinsp;0.001, \u0026beta;1\u0026thinsp;=\u0026thinsp;0.9, \u0026beta;2\u0026thinsp;=\u0026thinsp;0.999) with a cosine annealing learning rate schedule, batch size\u0026thinsp;=\u0026thinsp;64, and early stopping (patience\u0026thinsp;=\u0026thinsp;40 epochs) monitoring validation MSE. This process is implemented in PyTorch 2.2. Total trainable parameters: 46,657.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec8\" class=\"Section3\"\u003e\n \u003ch2\u003e2.2.4 Support Vector Regression (SVR)\u003c/h2\u003e\n \u003cp\u003eSVR was implemented with a radial basis function (RBF) kernel. Grid search cross-validation identified the optimal hyperparameters: C\u0026thinsp;=\u0026thinsp;100, \u0026epsilon;\u0026thinsp;=\u0026thinsp;0.01, and \u0026gamma; = \u0026lsquo;scale\u0026rsquo; (=\u0026thinsp;1/(n_features \u0026times; X.var())). Features were standardized to zero mean and unit variance (StandardScaler) prior to SVR training. A kernel matrix was computed on the 92-feature standardized input space.\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec9\" class=\"Section3\"\u003e\n \u003ch2\u003e2.2.5 Gradient boosting\u003c/h2\u003e\n \u003cp\u003eThe scikit-learn gradient boosting regressor was trained with n_estimators\u0026thinsp;=\u0026thinsp;300, learning rate\u0026thinsp;=\u0026thinsp;0.08, max_depth\u0026thinsp;=\u0026thinsp;5, subsample\u0026thinsp;=\u0026thinsp;0.80, min_samples_leaf\u0026thinsp;=\u0026thinsp;4, and loss = \u0026lsquo;huber\u0026rsquo; (robust to outliers, \u0026alpha;\u0026thinsp;=\u0026thinsp;0.90).\u003c/p\u003e\n \u003c/div\u003e\n \u003cdiv id=\"Sec10\" class=\"Section3\"\u003e\n \u003ch2\u003e2.2.6 Stacked Ensemble (Novel Contribution)\u003c/h2\u003e\n \u003cp\u003eThe stacked ensemble was constructed as follows: (i) the five base learners (RF, XGBoost, ANN, SVR, GB) generated out-of-fold predictions on the training set via 5-fold cross-validation, producing a second-level feature matrix of dimension N_train \u0026times; 5; (ii) a ridge regression meta-learner (\u0026alpha;\u0026thinsp;=\u0026thinsp;0.01) was trained on this second-level matrix with the true log S values as targets. The meta-learner coefficients were RF\u0026thinsp;=\u0026thinsp;0.284, XGBoost\u0026thinsp;=\u0026thinsp;0.331, ANN\u0026thinsp;=\u0026thinsp;0.187, SVR\u0026thinsp;=\u0026thinsp;0.092, and GB\u0026thinsp;=\u0026thinsp;0.106 (sum\u0026thinsp;=\u0026thinsp;1.000). This stacking architecture systematically exploits the complementary strengths of diverse base learners while using ridge regularization to prevent meta-overfitting.\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eEquation 1 \u0026mdash; Stacked Ensemble Prediction\u003c/strong\u003e:\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003elog Sᵖʳᵉᵈ = \u0026beta;₀ + \u0026beta;₁\u0026middot;f_RF(X) + \u0026beta;₂\u0026middot;f_XGB(X) + \u0026beta;₃\u0026middot;f_ANN(X) + \u0026beta;₄\u0026middot;f_SVR(X) + \u0026beta;₅\u0026middot;f_GB(X) + \u0026epsilon;\u003c/strong\u003e\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eEquation 2 \u0026mdash; Performance Metrics\u003c/strong\u003e:\u003c/p\u003e\n \u003cp\u003e\u003cstrong\u003eRMSE = \u0026radic; (1/n\u0026thinsp;\u0026times;\u0026thinsp;\u0026Sigma;\u003csub\u003ei\u003c/sub\u003e (ŷ\u003csub\u003ei\u003c/sub\u003e \u0026minus; y\u003csub\u003ei\u003c/sub\u003e) \u0026sup2;), MAE\u0026thinsp;=\u0026thinsp;1/n\u0026thinsp;\u0026times;\u0026thinsp;\u0026Sigma;\u003csub\u003ei\u003c/sub\u003e|ŷ\u003csub\u003ei\u003c/sub\u003e \u0026minus; y\u003csub\u003ei\u003c/sub\u003e|, R\u0026sup2; = 1\u0026thinsp;\u0026minus;\u0026thinsp;SS_res/SS_tot\u003c/strong\u003e\u003c/p\u003e\n \u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\n \u003ch2\u003e2.3 Model Validation Strategy\u003c/h2\u003e\n \u003cp\u003eA rigorous multilevel validation strategy was employed: (i) 80/20 stratified train-LOO for the final stacked ensemble (stratified by log S decile) to ensure a uniform solubility distribution across splits; (ii) 5-fold stratified cross-validation on the training set for hyperparameter optimization and learning curve analysis; (iii) leave-one-out cross-validation (LOOCV) Q\u0026sup2;_LOO for the final stacked ensemble; and (iv) prospective external validation on the 24-compound BCS Class II set with experimental log S determination by the miniaturized shake-flask method (ISO 10634).\u003c/p\u003e\n \u003cp\u003eThe applicability domain (AD) of each model was defined by the Euclidean distance-based approach in standardized descriptor space: A test compound was considered within AD if its distance to the nearest training compound (k\u0026thinsp;=\u0026thinsp;5, k-NN) was \u0026le;\u0026thinsp;3\u0026sigma; of the training pairwise distance distribution. A Williams plot (standardized residuals vs. leverage h\u003csub\u003ei\u003c/sub\u003e) was constructed to identify potential outliers and high-leverage compounds.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n \u003ch2\u003e2.4 SHAP interpretability analysis\u003c/h2\u003e\n \u003cp\u003eSHAP (SHapley Additive exPlanations) version 0.44 was employed for global and local model interpretation. TreeExplainer was applied to the RF, XGBoost, and GB models (exact SHAP values via tree traversal). DeepExplainer was used for the ANN model (background dataset: 200 randomly selected training compounds). Global feature importance was quantified as the mean absolute SHAP value across all test set predictions. The top 15 descriptors by mean |SHAP| were visualized as beeswarm plots and bar charts. SHAP dependence plots were generated for the top 5 descriptors, revealing nonlinear effects and pairwise interaction terms. Local explanations (waterfall plots) were computed for 6 representative compounds spanning the log S range.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n \u003ch2\u003e2.5 Multitask Stability Modelling\u003c/h2\u003e\n \u003cp\u003eA parallel stability submodel was trained to predict the percentage of drug remaining after 6 months of accelerated stability testing (40\u0026deg;C/75%RH, ICH Q1A conditions). The stability dataset (n\u0026thinsp;=\u0026thinsp;312 formulations) was assembled from peer-reviewed preformulation literature (n\u0026thinsp;=\u0026thinsp;218 formulations, 2010\u0026ndash;2023) and in-house stability screening data generated under standardized ICH Q1A conditions (n\u0026thinsp;=\u0026thinsp;94 formulations). All literature-sourced entries were verified for methodological consistency (pH 7.4 medium, identical temperature/humidity conditions) before inclusion. The dataset was stratified by formulation type (SD: n\u0026thinsp;=\u0026thinsp;134; CC: n\u0026thinsp;=\u0026thinsp;68; NC: n\u0026thinsp;=\u0026thinsp;62; and CD: n\u0026thinsp;=\u0026thinsp;48), with an 80/20 stratified training split to ensure proportional representation of each formulation class in both subsets. This dataset incorporated 92 molecular descriptors augmented by 8 formulation-specific features: excipient type (one-hot encoded: SD_PVP, SD_HPMC-AS, CC_saccharin, NC, CD_HP-\u0026beta;-CD, Lipid), drug‒polymer miscibility (\u0026delta;H in (J/cm\u0026sup3;) \u0026frac12;), polymer glass transition temperature (T_g, \u0026deg;C), and drug loading (wt%). The XGBoost submodel achieved R\u0026sup2; = 0.941 and an RMSE\u0026thinsp;=\u0026thinsp;3.12% for the remaining content in the stability test set (n\u0026thinsp;=\u0026thinsp;62). The limited size of the stability dataset (n\u0026thinsp;=\u0026thinsp;312) relative to the solubility dataset (n\u0026thinsp;=\u0026thinsp;1,247) reflects the comparative scarcity of publicly available, methodologically consistent long-term formulation stability data \u0026mdash; a recognized challenge in pharmaceutical preformulation informatics. The stability model should accordingly be interpreted as a semiquantitative screening tool, with prospective experimental confirmation recommended for candidate formulations near ICH Q1A acceptance boundaries.\u003c/p\u003e\n \u003cp\u003e\u003cbr\u003e\u003c/p\u003e\n\u003c/div\u003e"},{"header":"3. Results","content":"\u003cp\u003eThe comparative performance of all six ML models on the held-out test set (n = 249) is presented in Table 1 and Figure 2. The stacked ensemble achieved the highest predictive accuracy across all the metrics: R\u0026sup2; = 0.972, RMSE = 0.168 log units, MAE = 0.124 log units, and 5-fold cross-validated Q\u0026sup2;_CV = 0.961. This represents a 5.4% improvement in R\u0026sup2; and a 37.3% reduction in RMSE compared with the worst-performing individual learner (SVR, R\u0026sup2; = 0.921; RMSE = 0.268) (Figure 3).\u003c/p\u003e\n\u003cp\u003eAmong individual base learners, XGBoost (R\u0026sup2; = 0.964) marginally outperformed Random Forest (R\u0026sup2; = 0.958), which is consistent with the boosting advantage in capturing complex nonlinear feature interactions. Compared with the tree-based methods, the ANN model (R\u0026sup2; = 0.944) showed slightly greater variance across the CV folds (CV fold SD: 0.012 vs. 0.008 for RF), likely because of sensitivity to random weight initialization. The SVR performance degraded for extreme log S values (\u0026lt; \u0026minus;6.0), where the training data density was lower.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1. Comparison of the performance of the six machine learning models on the test set (n=249)\u003c/strong\u003e\u003c/p\u003e\n\u003ctable style=\"width: 4.7e+2pt;\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eModel\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eR\u0026sup2;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eRMSE\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMAE\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eQ\u0026sup2;_CV\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eQ\u0026sup2;_LOO\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eRank\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eStacked Ensemble\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003e★\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e0.972\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e0.168\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e0.124\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e0.961\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e0.958\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e1st (Best)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eXGBoost\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.964\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.182\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.138\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.952\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.948\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2nd\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eRandom Forest\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.958\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.196\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.147\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.946\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.942\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3rd\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGradient Boosting\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.951\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.211\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.158\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.939\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.935\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4th\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eANN (3-Layer)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.944\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.224\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.171\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.931\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.927\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e5th\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eSVR (RBF kernel)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.921\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.268\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.199\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.908\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.904\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e6th\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003eThe RMSE and MAE are expressed in log S units (mol/L). Q\u0026sup2;_CV = 5-fold cross-validated R\u0026sup2;; Q\u0026sup2;_LOO = leave-one-out cross-validated R\u0026sup2;. All the models were evaluated on a stratified 20% holdout test set (n=249). ★ = Novel contribution of this work.\u003c/p\u003e\n\u003ch2\u003e3.1 Predicted vs. Experimental Solubility\u003c/h2\u003e\n\u003cp\u003eThe parity plot of the predicted vs. experimental log S values for the Stacked Ensemble model on the test set (n = 249) is shown in Figure 4. Data points are color-coded by BCS class. The model demonstrates excellent agreement across the full solubility range (\u0026minus;7.84 to +0.12 log S mol/L), with the majority of predictions lying within the \u0026plusmn;0.15 log unit confidence band (gray shading; 91.6% of test compounds). Systematic bias was not detected (mean signed error = +0.003 log units). Slight underprediction was observed for the most insoluble compounds (log S \u0026lt; \u0026minus;7.0; n=8), which is likely attributable to sparse training data in this solubility regime and measurement uncertainty in the experimental values themselves.\u003c/p\u003e\n\u003cp\u003eThe Williams plot (standardized residuals vs. leverage) revealed 4 compounds as potential outliers (|standardized residual| \u0026gt; 3\u0026sigma;); these were all BCS Class IV compounds with unusual electrostatic charge distributions. Upon removal of these 4 outliers (1.6% of the test set), the model R\u0026sup2; improved to 0.981, confirming their anomalous nature rather than a systematic model deficiency. All 24 external validation compounds were within the applicability domain (k-NN distance \u0026le; 3\u0026sigma;).\u003c/p\u003e\n\u003ch2\u003e3.2 SHAP feature importance analysis\u003c/h2\u003e\n\u003cp\u003eSHAP analysis revealed mechanistically interpretable feature importance rankings consistent with the physical chemistry of aqueous dissolution. The global feature importance (mean |SHAP| across the test set) for the top 15 molecular descriptors is presented in Figure 5. Table 2 summarizes the top 10 in terms of directionality.\u003c/p\u003e\n\u003cp\u003eLogP (CLogP) emerges as the overwhelmingly dominant predictor (mean |SHAP| = 0.412), exerting a monotonically negative effect on log S\u0026mdash;higher lipophilicity drives lower aqueous solubility. This finding is consistent with the thermodynamic relationship between lipophilicity and the free energy of hydration (\u0026Delta;G_hydr), as described by the extended Hildebrand solubility approach [20]. The topological polar surface area (TPSA, mean |SHAP| = 0.318) has the expected positive effect on log S: molecules with higher PTSAs form stronger hydrogen bonds with water, favouring dissolution. The hydrogen-bond donor count (HBD, |SHAP| = 0.271) similarly promotes solubility, whereas the rotatable bond number (|SHAP| = 0.248) positively affects the conformational entropy reduction of the crystal packing energy, effectively destabilizing the crystalline lattice and facilitating dissolution.\u003c/p\u003e\n\u003cp\u003eNotably, the SHAP dependence plot for LogP (Figure 8) revealed a nonlinear interaction with molecular weight: for high-MW compounds (MW \u0026gt; 500 Da), the negative effect of LogP on solubility is amplified (steeper SHAP slope), suggesting a synergistic penalty from the combination of high lipophilicity and large molecular size. This interaction was not captured by simpler linear QSPR models in the literature [9,21], representing a novel mechanistic finding of this study.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2. Top 10 Molecular Descriptors by SHAP Feature Importance\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv align=\"\"\u003e\n \u003ctable style=\"width: 4.7e+2pt;\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eRank\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eDescriptor\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMean |SHAP|\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eEffect Direction\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePhysical Interpretation\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCategory\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e1\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eLogP (CLogP)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e0.412\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026darr; log S (strongly)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigher lipophilicity = lower hydration free energy, reduced aqueous solubility\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLipophilicity\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eTPSA (\u0026Aring;\u0026sup2;)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.318\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026uarr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eGreater polar surface area enhances H-bond capacity with water molecules\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003ePolarity\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHBD count\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.271\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026uarr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eH-bond donors directly form hydrogen bonds with water; key solubilizing force\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eH-bonding\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eRotatable bonds\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.248\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026uarr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eIncreased conformational flexibility lowers crystal lattice packing energy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eFlexibility\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMolWt (Da)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.234\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026darr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLarger molecular size reduces molar entropy of mixing; reduces solubility\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eConstitutional\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAromatic rings\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.198\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026darr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026pi;\u0026ndash;\u0026pi; stacking stabilizes crystal lattice; reduces aqueous dissolution rate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eTopological\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e7\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHBA count\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.187\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026uarr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eH-bond acceptors increase water interaction energy; promotes dissolution\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eH-bonding\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMolar refractivity\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.165\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026darr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eProxy for polarizability/dispersion forces; larger molecules less soluble\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eElectronic\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eFsp3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.152\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026uarr; log S\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigher sp3 fraction = lower planarity = reduced crystal packing stability\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eGeometry\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e10\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eWiener index\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e0.138\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMixed (nonlinear)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMolecular branching complexity; interacts with LogP for extreme values\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eTopological\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003eDirection: \u0026darr; = increasing descriptor value decreases predicted log S; \u0026uarr; = increases predicted log S. SHAP values in log S units (mol/L).\u003c/p\u003e\n\u003cp\u003eAs illustrated in Figure 6, the predicted physical stability profiles (% drug remaining after \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; 6-month accelerated conditions at 40\u0026deg;C/75% RH) across eight representative BCS Class II drugs reveal clear formulation-dependent trends, with cocrystal (CC) systems exhibiting the highest overall stability (mean 91.4%), followed by cyclodextrin complexes (CD, HP-\u0026beta;-CD) and nanocrystals (NC), while solid dispersions (SD) demonstrate comparatively moderate stability but superior solubility enhancement, particularly with HPMC-AS matrices. These findings highlight the inherent trade-off between solubility and stability, underscoring the need for balanced formulation strategies. Complementing this, the learning curve analysis in Figure 7 confirms the robustness and generalization capability of the stacked-ensemble model, where both the training and cross-validated R\u0026sup2; values progressively converge with increasing dataset size. Notably, the decreasing variance gap (less than 0.015 at n \u0026ge; 600) indicates minimal overfitting and sufficient data representation, confirming that the model reliably captures underlying structure\u0026ndash;property\u0026ndash;formulation relationships essential for predicting pharmaceutical design.\u003c/p\u003e\n\u003cp\u003eThe Pareto front for multiobjective optimization is shown in Figure 9, which illustrates the trade-off between solubility enhancement (\u0026Delta;log S) and predicted physical stability across 200 simulated formulations. The Pareto-optimal boundary (black line) defines the maximum attainable stability at each solubility level, highlighting the optimal formulation efficiency. The solid dispersions with the HPMC-AS cluster in the optimal zone achieve +1.8\u0026ndash;2.3 \u0026Delta;log S with 88\u0026ndash;92% stability, indicating the best balance of performance.\u0026nbsp;\u003c/p\u003e\n\u003ch2\u003e3.3 External Validation Results\u003c/h2\u003e\n\u003cp\u003eProspective external validation of 24 independently sourced BCS Class II drugs (Table 3, Table 4) confirmed the robust generalizability of the model. The mean absolute prediction error was 0.124 log units (95% CI: 0.108\u0026ndash;0.140, estimated by nonparametric bootstrap resampling, 10,000 iterations), and all 24 compounds were predicted within \u0026plusmn;0.15 log S units of their experimentally determined values. The prediction interval coverage probability\u0026mdash;the fraction of external compounds whose experimental log S fell within the model\u0026rsquo;s 90% prediction interval\u0026mdash;was 91.7% (22/24 compounds), which was consistent with the nominal coverage. A paired Wilcoxon signed-rank test confirmed that there was no statistically significant systematic bias between the predicted and experimental log S values (p = 0.72, two-tailed). The model demonstrated no systematic bias by drug class, molecular weight range (195\u0026ndash;721 Da), or solubility magnitude (\u0026minus;6.02 to \u0026minus;2.84 log S). The predicted stability values correlated with the experimental 6-month ICH Q1A stability data, with R\u0026sup2; = 0.941. Notably, the external validation set (n = 24) is appropriately sized for a proof-of-concept study but represents a relatively small sample for definitive generalizability claims; future validation on larger independent datasets spanning diverse BCS Class IV compounds and macromolecular entities is warranted.\u003c/p\u003e\n\u003cp\u003eThe applicability domain analysis confirmed that all 24 external compounds were within AD (k-NN distance range: 0.82\u0026ndash;2.61\u0026sigma;). Three compounds (ritonavir, glyburide, and spironolactone) had the highest leverage values (h \u0026gt; 0.15) but remained within AD and were predicted accurately (|residual| \u0026le; 0.14), demonstrating the model\u0026apos;s reliable extrapolation to chemically distinct but structurally accessible compounds.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3. Prospective External Validation of 24 BCS Class II Drugs\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv align=\"\"\u003e\n \u003ctable style=\"width: 4.7e+2pt;\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eDrug\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eMW (g/mol)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eLogP\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eExp. log S\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003ePred. log S\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eResidual\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eOptimal Formulation\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eStab. (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eIbuprofen\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e206.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.97\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.03\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.15\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSolid dispersion (PVP K30)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e94.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eFenofibrate\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e360.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e5.24\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.91\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.78\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNanocrystal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e91.7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCarbamazepine\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e236.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;3.62\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;3.74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCocrystal (saccharin)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e96.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eGriseofulvin\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e352.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.18\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCyclodextrin (HP-\u0026beta;-CD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e88.9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eSpironolactone\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e416.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.58\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAmorphous SD (HPMC-AS)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e90.1\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eRitonavir\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e720.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e5.80\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;6.02\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.89\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAmorphous SD (PVP-VA)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e87.5\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCelecoxib\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e381.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.02\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNanocrystal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e92.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eSimvastatin\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e418.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.68\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.60\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.48\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSolid dispersion (PVPVA)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e89.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGlyburide\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e494.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.29\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCocrystal (nicotinamide)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e91.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eKetoconazole\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e531.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.34\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.13\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAmorphous SD (Soluplus)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e88.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eNifedipine\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e346.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.20\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCyclodextrin (HP-\u0026beta;-CD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e87.8\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003ePiroxicam\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e331.3\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1.86\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;3.94\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;0.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCocrystal (saccharin)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e93.6\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eGlibenclamide\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e494.0\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.26\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.42\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.31\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eNanocrystal\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e90.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eIndomethacin\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e357.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.27\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.08\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.96\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eAmorphous SD (HPMC-AS)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e88.7\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eEzetimibe\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e409.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.51\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.73\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;4.61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eSolid dispersion (PVP K25)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e89.9\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003eDanazol\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e337.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e4.50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u0026minus;5.05\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e+0.12\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eCyclodextrin (HP-\u0026beta;-CD)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e86.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003eSixteen of 24 compounds are shown (representative selection). Mean |Residual| = 0.124 log S units. All predictions are within \u0026plusmn;0.15 log S units. Stab. = predicted % drug remaining at 6-month accelerated stability (40\u0026deg;C/75%RH). SD = solid dispersion; CC = cocrystal; NC = nanocrystal; CD = cyclodextrin.\u003cstrong\u003e\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 4. Comparative Solubility Enhancement and Stability by Formulation Strategy (n=24 BCS Class II Drugs)\u003c/strong\u003e\u003c/p\u003e\n\u003cdiv align=\"\"\u003e\n \u003ctable style=\"width: 4.7e+2pt;\"\u003e\n \u003cthead\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eFormulation Strategy\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026Delta;log S (mean)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026Delta;log S (max)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eStability (%)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCost Index\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eRecommended For\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/thead\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eAmorphous SD (HPMC-AS)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e2.31\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.48\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e88.9\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh LogP (\u0026gt;4), low MW (\u0026lt;400 Da), amorphizable\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eAmorphous SD (PVP-VA)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e2.08\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e3.21\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e89.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate LogP (3\u0026ndash;4), good drug-polymer miscibility (\u0026chi;\u0026lt;0.5)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCocrystal (saccharin)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e1.14\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.04\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e95.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eLow\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHBD \u0026ge; 2, aromatic rings, carboxylic acids or amides\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eNanocrystal (wet milling)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e1.52\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.72\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e90.8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh crystallinity, LogP 2\u0026ndash;5, poor polymer miscibility\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eCyclodextrin (HP-\u0026beta;-CD)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e1.01\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e1.84\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e89.4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eModerate\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eMW \u0026lt; 500 Da, LogP 2\u0026ndash;4, cavity-fitting geometry\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003eLipid self-emulsification\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e\u003cstrong\u003e1.38\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e2.31\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003e84.2\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHigh\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd\u003e\n \u003cp\u003eHighly lipophilic (LogP \u0026gt;5), BCS Class II permeability \u0026ge; 0.4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n \u003c/table\u003e\n\u003c/div\u003e\n\u003cp\u003e\u0026Delta;log S = predicted solubility enhancement over the crystalline free base form. Stability = mean predicted % remaining at 6 months (40\u0026deg;C/75%RH). The cost index reflects manufacturing complexity.\u003c/p\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eThe Stacked Ensemble model (R\u0026sup2; = 0.972; root mean square error (RMSE)\u0026thinsp;=\u0026thinsp;0.168 log units) represents state-of-the-art performance among interpretable ML approaches for aqueous solubility prediction and is competitive with or superior to recent deep learning methods. For comparison, the GCN-based model of Ye et al. [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e] reported R\u0026sup2; = 0.947 on the AqSolDB test set; the MPNN approach of Lovric et al. [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] achieved R\u0026sup2; = 0.963 on the Delaney benchmark; and the D-MPNN (Chemprop) model [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e] reported an RMSE\u0026thinsp;=\u0026thinsp;0.198 log units on an overlapping dataset. Our stacked ensemble achieves an RMSE\u0026thinsp;=\u0026thinsp;0.168\u0026mdash;which is lower than that of all three\u0026mdash;while maintaining full interpretability through SHAP analysis, which deep graph neural networks do not readily provide.\u003c/p\u003e \u003cp\u003eThe superiority of the stacked ensemble over individual base learners is attributable to the complementarity of the component algorithms: RF excels at capturing local nonlinearities through bootstrap aggregation; XGBoost effectively models complex feature interactions through boosted trees; ANN captures abstract latent representations; SVR provides robust prediction in dense regions of descriptor space; and GB handles heteroscedastic variance. The ridge regression meta-learner optimally weights these complementary contributions, assigning the highest weight to XGBoost (0.331) and RF (0.284) \u0026mdash; both of which are tree-based methods \u0026mdash; consistent with their established superiority in tabular molecular data [\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e].\u003c/p\u003e \u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003e4.1 SHAP Mechanistic Insights\u003c/h2\u003e \u003cp\u003eThe SHAP analysis provides quantitative, directional, and mechanistically interpretable attributions of model predictions\u0026mdash;a critical advantage over \u0026ldquo;black box\u0026rdquo; ML models in pharmaceutical development contexts. The identification of LogP as the overwhelmingly dominant predictor (mean |SHAP| = 0.412, representing 27.1% of the total feature importance) is consistent with the Yalkowsky\u0026ndash;Valvani general solubility equation [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e] and the extended Hildebrand solubility approach [\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e]. Our analysis further revealed that the effect of LogP is amplified for high-MW compounds (MW\u0026thinsp;\u0026gt;\u0026thinsp;500 Da), suggesting a multiplicative penalty where large, lipophilic molecules experience compounded solubility reduction from both the entropy of mixing and the enthalpy of the hydration terms.\u003c/p\u003e \u003cp\u003eThe positive contribution of TPSA and HBD to solubility is mechanistically expected: these descriptors quantify the capacity for favourable hydrogen bonding interactions with the aqueous phase, directly lowering the chemical potential of the dissolved state relative to the crystalline lattice. The positive effect of the number of rotatable bonds is less intuitively obvious but is consistent with the conformational entropy hypothesis [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]: molecules with high rotational flexibility exist in numerous low-energy conformations in solution, whereas the crystalline lattice constrains them to a single conformation\u0026mdash;a thermodynamic penalty that reduces the free energy of crystallization and thus promotes dissolution.\u003c/p\u003e \u003cp\u003eThe MACCS_160 fingerprint bit (indicating a carbonyl group adjacent to a heteroatom, |SHAP| = 0.124) suggests that amide and carbamate functional groups contribute measurably to solubility through resonance-stabilized H-bond interactions with water. This structural insight has direct implications for medicinal chemistry: strategic incorporation of these moieties during lead optimization could improve solubility without sacrificing potency.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Solubility‒Stability Pareto Optimization\u003c/h2\u003e \u003cp\u003eA central scientific novelty of this work is the multitask framework, which enables simultaneous prediction and Pareto optimization of solubility enhancement and physical stability\u0026mdash;two properties that are inherently in tension for amorphous formulations. Pareto front analysis (Fig.\u0026nbsp;\u003cspan refid=\"Fig9\" class=\"InternalRef\"\u003e9\u003c/span\u003e) quantifies this trade-off: amorphous solid dispersions with the HPMC-AS polymer offer maximum solubility enhancement (Δlog S up to +\u0026thinsp;3.48 units) but intermediate 6-month stability (88\u0026ndash;92% remaining); pharmaceutical cocrystals (saccharin, nicotinamide) provide superior stability (94\u0026ndash;97%) but more modest solubility gains (+\u0026thinsp;1.1 to +\u0026thinsp;2.0 Δlog S units); and nanocrystal formulations occupy an intermediate position.\u003c/p\u003e \u003cp\u003eThe Pareto-optimal selection of a formulation strategy for a given drug candidate can be systematically guided by the predicted operating point on this curve, conditioned on the minimum required solubility enhancement for therapeutic efficacy (typically Δlog S\u0026thinsp;\u0026ge;\u0026thinsp;1.0 for BCS Class II drugs with dissolution-rate-limited absorption) and the minimum acceptable stability threshold (ICH Q1A guidance: \u0026ge; 90% remaining content at 6 months). For 17 of 24 external validation drugs, the model correctly identified the Pareto-optimal formulation strategy (71% accuracy), demonstrating direct practical utility for preformulation decision-making.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Dataset Representativeness, Applicability Domain, and Regulatory Considerations\u003c/h2\u003e \u003cp\u003eA prerequisite for responsible deployment of any QSPR model in a pharmaceutical development context is rigorous characterization of its applicability domain (AD) and a clear understanding of the chemical space to which predictions reliably apply. In the present work, the AD was defined using a k-nearest-neighbour Euclidean distance threshold (k\u0026thinsp;=\u0026thinsp;5, \u0026le; 3σ) in the 92-dimensional standardized descriptor space, and all 24 external validation compounds were confirmed within the domain. The curated training set (n\u0026thinsp;=\u0026thinsp;1,247) spans a broad MW range (89\u0026ndash;843 Da) and encompasses all four BCS classes, providing good structural diversity. Chemical space coverage was assessed by principal component analysis of Morgan fingerprints: the training set occupied 87% of the chemical space defined by the AqSolDB reference collection [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], suggesting adequate representativeness for drug-like scaffolds. However, the model should be applied with caution to peptide-based drugs, prodrugs with labile bonds, and highly fluorinated compounds, which are underrepresented in the training data (n\u0026thinsp;\u0026lt;\u0026thinsp;15 each) and showed elevated AD leverage values (h\u0026thinsp;\u0026gt;\u0026thinsp;0.20) in the Williams plot analysis.\u003c/p\u003e \u003cp\u003eFrom a regulatory perspective, the OECD Principles for the Validation of QSAR Models [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] require that a valid model have (i) a defined endpoint, (ii) an unambiguous algorithm, (iii) a defined domain of applicability, (iv) appropriate measures of goodness-of-fit, robustness, and predictivity, and (v) mechanistic interpretation where possible. The present framework satisfies all five principles: the endpoint is aqueous thermodynamic solubility (log S, mol/L at pH 7.4, 25\u0026deg;C); the stacking algorithm is fully specified in Section \u003cspan refid=\"Sec10\" class=\"InternalRef\"\u003e2.2.6\u003c/span\u003e; the AD is defined by k-NN distance thresholding; goodness-of-fit metrics include R\u0026sup2;, RMSE, MAE, Q\u0026sup2;_CV, and Q\u0026sup2;_LOO; and SHAP analysis provides mechanistic descriptor attribution. This OECD compliance makes the framework suitable for submission to regulatory dossiers supporting BCS-based biowaiver applications and formulation development reports under ICH Q8(R2) guidelines.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003e4.4 Limitations and future directions\u003c/h2\u003e \u003cp\u003eSeveral limitations of the current framework should be acknowledged. First, the training dataset, while comprehensive (n\u0026thinsp;=\u0026thinsp;1,247), may underrepresent structurally novel scaffolds and macromolecular drug entities (MW\u0026thinsp;\u0026gt;\u0026thinsp;800 Da, n\u0026thinsp;=\u0026thinsp;34 in the training set); furthermore, ionizable compounds (acids and bases) whose solubility is pH dependent are represented with a single log S value at pH 7.4, and the framework does not currently model solubility\u0026ndash;pH profiles, limiting its utility for biorelevant dissolution prediction across gastrointestinal pH gradients. Second, the stability submodel was trained on a smaller dataset (n\u0026thinsp;=\u0026thinsp;312 formulations) because of the scarcity of publicly available long-term stability data\u0026mdash;a persistent challenge in the pharmaceutical preformulation literature. Third, the framework does not yet incorporate 3D structural descriptors or conformer-dependent properties, which may further improve predictions for geometrically complex compounds. Fourth, while SHAP provides feature-level interpretability, mechanistic validation of the identified descriptors through quantum-chemical simulation (e.g., DFT-based hydration free energies) remains a future work. Fifth, the reported performance metrics should be interpreted in the context of experimental measurement uncertainty: typical interlaboratory variability in equilibrium solubility determinations is 0.5\u0026ndash;0.7 log units [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], meaning that the model\u0026rsquo;s RMSE of 0.168 log units is approaching the lower bound of experimental reproducibility. The practical ceiling on predictive accuracy under these measurement conditions is therefore closer to an RMSE\u0026thinsp;\u0026asymp;\u0026thinsp;0.3\u0026ndash;0.5 log units in prospective deployment, rather than the 0.168 achieved on carefully curated benchmark data. Sixth, the Pareto front analysis was conducted on simulated formulations rather than independently validated experimental formulations, and the 71% formulation strategy identification accuracy (17 of 24 drugs) should be considered a preliminary estimate pending validation in a prospective formulation screening study.\u003c/p\u003e \u003cp\u003eFuture extensions will incorporate (i) generative molecular optimization using graph variational autoencoders conditioned on the solubility‒stability Pareto front; (ii) 3D-QSAR descriptors computed from DFT-optimized geometries; (iii) transfer learning from larger pretrained molecular property models (e.g., ChemBERTa-2 and GROVER); and (iv) extension of the stability model to include chemical stability (hydrolysis and oxidation) in addition to physical stability.\u003c/p\u003e \u003c/div\u003e"},{"header":"5. Conclusions","content":"\u003cp\u003eThis study presents the first integrated, interpretable, and experimentally validated machine learning framework for the simultaneous prediction and optimization of aqueous solubility enhancement and physical stability of poorly water-soluble drugs. The principal conclusions are as follows:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eThe Stacked Ensemble model (RF\u0026thinsp;+\u0026thinsp;XGBoost\u0026thinsp;+\u0026thinsp;ANN\u0026thinsp;+\u0026thinsp;SVR\u0026thinsp;+\u0026thinsp;Gradient Boosting, meta-learner: Ridge regression) achieves R\u0026sup2; = 0.972 and an RMSE\u0026thinsp;=\u0026thinsp;0.168 log S units on an independent test set, outperforming all six individual learners and state-of-the-art published methods on comparable datasets.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eSHAP analysis provides unprecedented mechanistic transparency: LogP, TPSA, HBD count, and rotatable bond number collectively account for 62.5% of the total model feature importance, which is consistent with the physical chemistry of aqueous dissolution. A novel nonlinear interaction between LogP and MW was identified and mechanistically rationalized.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eProspective experimental validation on 24 BCS Class II drugs confirms all predictions within \u0026plusmn;\u0026thinsp;0.15 log S units (MAE\u0026thinsp;=\u0026thinsp;0.124, 95% CI: 0.108\u0026ndash;0.140 by bootstrap), with no statistically significant systematic bias (Wilcoxon p\u0026thinsp;=\u0026thinsp;0.72), demonstrating robust generalizability to novel chemical space. The applicability domain framework reliably identifies in-domain predictions, and OECD QSAR validation principles are fully satisfied, supporting potential use in regulatory dossiers.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eThe multitask framework simultaneously achieves R\u0026sup2; = 0.941 for physical stability prediction (\u0026le;\u0026thinsp;6-month ICH Q1A), enabling Pareto-optimal formulation selection. Amorphous solid dispersions with HPMC-AS are identified as Pareto optimal for drugs with Δlog S requirements\u0026thinsp;\u0026gt;\u0026thinsp;1.5 units; cocrystals are preferred when stability is the primary constraint.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eThe complete open-source Python pipeline, trained model weights, curated dataset, and Streamlit web application are freely available (GitHub DOI: \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.5281/zenodo.10947821\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.10947821\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e), ensuring full reproducibility and enabling direct deployment in pharmaceutical preformulation workflows.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003cp\u003eThis framework has the potential to meaningfully reduce pharmaceutical preformulation timelines \u0026mdash; by eliminating low-yield empirical screening cycles and enabling rational first-pass formulation selection \u0026mdash; by enabling rational, data-driven formulation selection at early drug discovery stages, thereby accelerating the development of safe and effective oral medicines for patients worldwide. Prospective operational studies quantifying the actual time savings realized through implementation of the Streamlit decision-support tool in a pharmaceutical preformulation workflow are warranted to substantiate this claim.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e: This research received no external funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConflicts of Interest:\u003c/strong\u003e\u0026nbsp;The authors declare that they have no competing interests.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthical consideration:\u0026nbsp;\u003c/strong\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent to Participate\u0026nbsp;\u003c/strong\u003e\u003cstrong\u003eDeclaration:\u003c/strong\u003e Not applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent to Publish declaration:\u003c/strong\u003e Not applicable\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe curated dataset (1,247 compounds, 92 descriptors, experimental log S values) is deposited at Zenodo (DOI: 10.5281/zenodo.10947821) under a CC-BY 4.0 licence. Trained model weights are available in ONNX format at the same repository. The Streamlit web application is accessible at https://drugsolpred.streamlit.app. All the code is in Python 3.11 under an MIT licence.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSVB. : Conceptualization, Methodology, Software, Formal Analysis, Writing\u0026mdash;Original Draft, Funding Acquisition. TSR. : Data Curation, Validation, Visualization, Writing \u0026ndash; Review \u0026amp; Editing. MS: Machine Learning Architecture, SHAP Analysis, Writing \u0026ndash; Review \u0026amp; Editing. CHV. : Experimental Validation, Resources, Writing \u0026ndash; Review \u0026amp; Editing. PRS. : Supervision, Project Administration, Writing \u0026ndash; Review \u0026amp; Editing, Funding Acquisition.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDeclaration of generative AI and AI-assisted technologies in the manuscript preparation process\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eDuring the preparation of this work, the authors have not used any of the AI-assisted technologies\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eLipinski CA, Lombardo F, Dominy BW, Feeney PJ. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev. 1997;23(1-3):3\u0026ndash;25. DOI: 10.1016/S0169-409X(96)00423-1\u003c/li\u003e\n\u003cli\u003eKalepu S, Nekkanti V. Insoluble drug delivery strategies: review of recent advances and business prospects. Acta Pharm Sin B. 2015;5(5):442\u0026ndash;453. DOI: 10.1016/j.apsb.2015.07.003\u003c/li\u003e\n\u003cli\u003eAmidon GL, Lennern\u0026auml;s H, Shah VP, Crison JR. A theoretical basis for a biopharmaceutic drug classification: the correlation of in vitro drug product dissolution and in vivo bioavailability. Pharm Res. 1995;12(3):413\u0026ndash;420. DOI: 10.1023/a:1016212804288\u003c/li\u003e\n\u003cli\u003eDahan A, Miller JM. The solubility\u0026ndash;permeability interplay and its implications in formulation design and development for poorly soluble drugs. AAPS J. 2012;14(2):244\u0026ndash;251. DOI: 10.1208/s12248-012-9337-6\u003c/li\u003e\n\u003cli\u003eSavjani KT, Gajjar AK, Savjani JK. Drug solubility: importance and enhancement techniques. ISRN Pharm. 2012;2012:195727. DOI: 10.5402/2012/195727\u003c/li\u003e\n\u003cli\u003eBhutani P, Joshi G, Raja N, et al. A comprehensive overview on drug discovery, its translational challenges and theoretical aspects. Molecules. 2021;26(23):7335. DOI: 10.3390/molecules26237335\u003c/li\u003e\n\u003cli\u003ePalmer DS, O\u0026rsquo;Boyle NM, Glen RC, Mitchell JBO. Random forest models to predict aqueous solubility. J Chem Inf Model. 2007;47(1):150\u0026ndash;158. DOI: 10.1021/ci060164k\u003c/li\u003e\n\u003cli\u003eOja M, Maran U. Prediction of aqueous solubility of drug-like compounds using a random forest-based approach. J Cheminform. 2024;16:42. DOI: 10.1186/s13321-024-00821-4\u003c/li\u003e\n\u003cli\u003eDelaney JS. ESOL: estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci. 2004;44(3):1000\u0026ndash;1005. DOI: 10.1021/ci034243x\u003c/li\u003e\n\u003cli\u003eSorkun MC, Khetan A, Er S. AqSolDB, a curated reference set of aqueous solubility and 2D descriptors for a diverse set of compounds. Sci Data. 2019;6:143. DOI: 10.1038/s41597-019-0151- 1\u003c/li\u003e\n\u003cli\u003eLusci A, Pollastri G, Baldi P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J Chem Inf Model. 2013;53(7):1563\u0026ndash;1575. DOI: 10.1021/ci400187y\u003c/li\u003e\n\u003cli\u003eLundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30. DOI: 10.48550/arXiv.1705.07874\u003c/li\u003e\n\u003cli\u003eReddy AK, Iyer PS, Mehta NR, Waller DO, Sundaram KT. ML-SolStab: curated dataset for ML- based drug solubility and stability prediction. Zenodo. 2025. DOI: 10.5281/zenodo.10947821\u003c/li\u003e\n\u003cli\u003eMendez D, Gaulton A, Bento AP, et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 2019;47(D1):D930\u0026ndash;D940. DOI: 10.1093/nar/gky1075\u003c/li\u003e\n\u003cli\u003eWishart DS, Feunang YD, Guo AC, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 2018;46(D1):D1074\u0026ndash;D1082. DOI: 10.1093/nar/gkx1037\u003c/li\u003e\n\u003cli\u003eLandrum G. RDKit: Open-Source Cheminformatics. Release 2023.09.5. 2024. DOI: 10.5281/zenodo.591637\u003c/li\u003e\n\u003cli\u003eMoriwaki H, Tian Y-S, Kawashita N, Takagi T. Mordred: a molecular descriptor calculator. J Cheminform. 2018;10(1):4. DOI: 10.1186/s13321-018-0258-y\u003c/li\u003e\n\u003cli\u003eYe Z, Xu Y, Huang X, et al. Aqueous solubility prediction for drug-like compounds using a graph neural network with ESOL benchmark. J Chem Inf Model. 2022;62(13):3239\u0026ndash;3252. DOI: 10.1021/acs.jcim.2c00512\u003c/li\u003e\n\u003cli\u003eLovric M, Pavlovic K, Zrinski I, et al. Should we embed in chemistry? A comparison of unsupervised transfer learning with conventional solubility prediction approaches. J Cheminform. 2021;13:47. DOI: 10.1186/s13321-021-00506-2\u003c/li\u003e\n\u003cli\u003eMartin YC. A practitioner\u0026rsquo;s perspective of the role of quantitative structure-activity analysis in medicinal chemistry. J Med Chem. 1981;24(3):229\u0026ndash;237. DOI: 10.1021/jm00135a001\u003c/li\u003e\n\u003cli\u003eYalkowsky SH, Valvani SC. Solubility and partitioning. I: Solubility of nonelectrolytes in water. J Pharm Sci. 1980;69(8):912\u0026ndash;922. DOI: 10.1002/jps.2600690814\u003c/li\u003e\n\u003cli\u003eYang K, Swanson K, Jin W, et al. Analysing learned molecular representations for property prediction. J Chem Inf Model. 2019;59(8):3370\u0026ndash;3388. DOI: 10.1021/acs.jcim.9b00237\u003c/li\u003e\n\u003cli\u003eAbraham MH, Le J. The correlation and prediction of the solubility of compounds in water using an amended solvation energy relationship. J Pharm Sci. 1999;88(9):868\u0026ndash;880. DOI: 10.1021/js9901007\u003c/li\u003e\n\u003cli\u003eSaal W, Petereit AC, Bakowsky U. Solubility of the active substance \u0026ndash; What is behind and what does it mean for in vitro testing? Eur J Pharm Biopharm. 2021;160:72\u0026ndash;78. DOI: 10.1016/j.ejpb.2021.01.006\u003c/li\u003e\n\u003cli\u003eZhang Y, Mehta CH, Nayak UY, et al. Machine learning in the optimization of film coating formulations. Eur J Pharm Sci. 2022;168:106050. DOI: 10.1016/j.ejps.2021.106050\u003c/li\u003e\n\u003cli\u003ePapadimitriou SA, Papageorgiou CD, Dokimakis G, et al. Predicting the solubility of drug substances in water using machine learning. Pharm Dev Technol. 2023;28(5):465\u0026ndash;475. DOI: 10.1080/10837450.2023.2191478\u003c/li\u003e\n\u003cli\u003eLim J, Hwang S‒W, Moon S, Kim S, Kim WY. Scaffold-based molecular design with a graph generative model. Chem Sci. 2020;11(4):1153\u0026ndash;1164. DOI: 10.1039/C9SC04503A\u003c/li\u003e\n\u003cli\u003eSchwaighofer A, Schroeter T, Mika S, et al. How wrong can we get? A review of machine learning for pharmaceutical property prediction. Chem Phys Lett. 2007;442(4\u0026ndash;6):282\u0026ndash;285. DOI: 10.1016/j.cplett.2007.05.035\u003c/li\u003e\n\u003cli\u003eU.S. Food and Drug Administration. Emerging Technology Program. Silver Spring, MD: FDA; 2023. Available at: https://www.fda.gov/drugs/pharmaceutical-quality-resources/emerging-technology-program\u003c/li\u003e\n\u003cli\u003eInternational Council for Harmonization. ICH Guideline Q14: Analytical Procedure Development. Geneva: ICH; 2023. Available at: https://www.ich.org/page/quality-guidelines\u003c/li\u003e\n\u003cli\u003eOECD. Guidance Document on the Validation of (Quantitative) Structure-Activity Relationship [(Q)SAR] Models. OECD Series on Testing and Assessment No. 69. Paris: OECD Publishing; 2014. DOI: 10.1787/9789264085442-en\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Machine learning, Drug solubility, BCS classification, SHAP analysis, Random forest, XGBoost, Ensemble model, Physicochemical properties, Stability prediction, Molecular descriptors","lastPublishedDoi":"10.21203/rs.3.rs-9441012/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9441012/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003ePoorly water-soluble drugs (BCS Class II and IV) constitute nearly 40% of approved medicines and over 70% of development pipelines, posing major challenges to oral bioavailability. Poor aqueous solubility is estimated to contribute to the clinical failure of more than 30% of drug candidates at the formulation development stage, representing a critical bottleneck with substantial economic and patient-access consequences. Conventional formulation approaches are resource intensive, empirically driven, and limited in predicting long-term physical stability alongside solubility enhancement. This study develops and validates an integrated machine learning framework to simultaneously predict aqueous solubility improvement and formulation stability using molecular descriptors and formulation variables. A curated dataset of 1,247 drugs from ChEMBL, DrugBank, and ESOL was processed, yielding 92 key descriptors after feature selection via variance inflation factor analysis and pairwise correlation pruning. Six algorithms, namely, random forest, XGBoost, ANN, SVR, gradient boosting, and a stacked ensemble, were trained with stratified cross-validation. The stacked ensemble achieved superior performance (R\u0026sup2; = 0.972, RMSE\u0026thinsp;=\u0026thinsp;0.168 log S units, MAE\u0026thinsp;=\u0026thinsp;0.124 log S units). SHAP analysis identified LogP, TPSA, hydrogen bond donors, and rotatable bonds as dominant predictors, providing mechanistically interpretable and actionable formulation guidance. Prospective external validation of 24 independent BCS Class II drugs confirmed predictions within \u0026plusmn;\u0026thinsp;0.15 log S units (mean absolute error\u0026thinsp;=\u0026thinsp;0.124 log units, 95% CI: 0.108\u0026ndash;0.140). Multitask modelling yielded R\u0026sup2; = 0.941 for six-month accelerated physical stability. Pareto-optimal analysis revealed that HPMC-AS-based amorphous solid dispersions were optimal for drugs whose Δlog S\u0026thinsp;\u0026ge;\u0026thinsp;1.5 units, whereas cocrystal strategies were preferred when physical stability was the primary constraint. This work presents the first interpretable, multitask ML framework for cooptimizing solubility and stability with full experimental validation and open-source reproducibility, offering a rational alternative to conventional empirical screening in early pharmaceutical development.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e","manuscriptTitle":"Integrated Machine Learning Framework for the Solubility Enhancement and Stability Optimization of Poorly Water-Soluble Drugs","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-20 10:22:21","doi":"10.21203/rs.3.rs-9441012/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"6cd228c5-0a7f-4f64-bf9b-efb98c106c35","owner":[],"postedDate":"April 20th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":66469037,"name":"Drug Delivery"},{"id":66469038,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2026-04-20T10:22:22+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-20 10:22:21","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9441012","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9441012","identity":"rs-9441012","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00