Deep Learning for Green Chemistry: An AI-Enabled Pathway for Biodegradability Prediction and Organic Material Discovery

doi:10.21203/rs.3.rs-4002218/v1

Deep Learning for Green Chemistry: An AI-Enabled Pathway for Biodegradability Prediction and Organic Material Discovery

2024 · doi:10.21203/rs.3.rs-4002218/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 124,549 characters · extracted from preprint-html · click to expand

Deep Learning for Green Chemistry: An AI-Enabled Pathway for Biodegradability Prediction and Organic Material Discovery | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Deep Learning for Green Chemistry: An AI-Enabled Pathway for Biodegradability Prediction and Organic Material Discovery Dela Quarme Gbadago, Gyuyeong Hwang, Kihwan Lee, Sungwon Hwang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4002218/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 12 Jun, 2024 Read the published version in Korean Journal of Chemical Engineering → Version 1 posted 5 You are reading this latest preprint version Abstract The increasing global demand for eco-friendly products is driving innovation in sustainable chemical synthesis, particularly the development of biodegradable substances. Herein, a novel method utilizing artificial intelligence (AI) to predict the biodegradability of organic compounds is presented, overcoming the limitations of traditional prediction methods that rely on laborious and costly density functional theory (DFT) calculations. We propose leveraging readily available molecular formulas and structures represented by simplified molecular-input line-entry system (SMILES) notation and molecular images to develop an effective AI-based prediction model using state-of-the-art machine learning techniques, including deep convolutional neural networks (CNN) and long-short term memory (LSTM) learning algorithms, capable of extracting meaningful molecular features and spatiotemporal relationships. The model is further enhanced with reinforcement learning (RL) to better predict and discover new biodegradable materials by rewarding the system for identifying unique and biodegradable compounds. The combined CNN-LSTM model achieved an 87.2% prediction accuracy, outperforming CNN- (75.4%) and LSTM-only (79.3%) models. The RL-assisted generator model produced approximately 60% valid SMILES structures, with over 80% being unique to the training dataset, demonstrating the model's capability to generate novel compounds with potential for practical application in sustainable chemistry. The model was extended to develop novel electrolytes with desired molecular weight distribution. Biodegradability SMILES Green chemistry Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Figure 7 1. Introduction Throughout scientific development, humanity has produced an abundance of organic compounds, many of which are utilized once and then discarded. The yearly production of plastic has reached an astonishing 450 million tons, with 340 million tons being generated as waste [ 1 ]. Regrettably, these organic compounds exhibit remarkable resistance to natural decomposition, leading to their persistence in the environment and posing significant threats to human well-being and ecosystems[ 2 ]. Consequently, assessing the biodegradability of organic compounds has been increasingly regarded as crucial in recent times. Following the European Registration, Evaluation, Authorization, and Restriction of Chemicals (REACH) regulation, companies engaged in the manufacturing or importing of chemicals exceeding 1 ton per year are mandated to provide detailed information regarding the biodegradability of their compounds[ 3 ]. To evaluate biodegradability, standardized test methods published by prestigious organizations such as the Organization for Economic Co-operation and Development (OECD)[ 4 ] and Japan's Ministry of International Trade and Industry (MITI)[ 5 ] are primarily employed. In addition to assessing the biodegradability of existing compounds, the significance of discovering novel biodegradable organic compounds is also growing. However, searching for potential candidates within the entire compound space is nearly impossible due to its vast scale, estimated to range from 10 23 to 10 60 . Predicting new molecules through calculations, synthesizing them, and testing their physical properties is time-consuming. As a result, only approximately 10 8 compounds have been synthesized thus far[ 6 ]. Utilizing generative models for discovering new molecules alleviates these challenges. Unlike conventional methods, generative models operate through inverse modeling. This means that new molecules are generated based on desired properties, offering a more efficient approach to exploration. Different methods have been devised to enable the incorporation of complex molecular structures into neural networks. One prevalent approach is using Simplified Molecular Input Line Entry System (SMILES)[ 7 ], which converts molecules into a one-dimensional text array following a specific set of rules. Due to its effectiveness, SMILES is widely employed in many molecular generation models. Recently, the use of generative models for chemical substance discovery has been actively researched[ 8 ]. Early generative models were developed by combining recurrent neural networks (RNNs) and reinforcement learning[ 9 ]. However, to overcome the limitations of these models, various types of generative models have been developed. Chiu et al.[ 10 ] proposed a method for predicting the hydrolysis rate by utilizing not only the SMILES representation but also the partial charge of the molecule as inputs to the autoencoder. Wang et al.[ 11 ] addressed the challenge of balancing desirable properties and novelty in molecular design. They developed a model that interprets the ligand-receptor structure by taking the molecular 3D structure as an input. J Arús-Pous et al.[ 12 ] divided the existing dataset into subsets with desired molecular scaffolds to devise a strategy to create molecules with specific characteristics without using reinforcement learning. Cao et al.[ 13 ] conducted research on avoiding the computationally expensive likelihood-matching process. They used generative adversarial networks (GANs) with graphs as inputs. Tang et al.[ 14 ] employed a Support Vector Machine (SVM) classifier to enhance the prediction accuracy and overcome the limitations of linear regression when predicting the biodegradability of large molecules. Dollar et al.[ 15 ] attempted to introduce the attention mechanism, commonly used in translation tasks, into variational autoencoders (VAE) for de novo molecular design. While several studies have been conducted in this area, there is a notable lack of research on generative models for discovering biodegradable organic compounds. The main challenge lies in training a model due to the severe insufficiency of the biodegradability database. In contrast to the readily available abundance of information, such as LogP, which can be easily accessed through methods like RDkit, the resources for biodegradability data remain scarce. As a response to this issue, a study was carried out by Lunghini et al.[ 16 ] to construct a substantial database by integrating various biodegradability data. Additionally, given the complex mechanisms determining the biodegradation rate, numerous models employing the Quantitative Structure-Activity Relationship (QSAR) method are being explored to classify compounds into biodegradable and non-biodegradable substances[ 17 , 18 ]. However, these models are imperfect, mainly due to their limited applicability scope. Furthermore, like the previous examples, much research has focused on enhancing prediction performance by altering the generative model. However, a limited body of research is dedicated to improving the prediction model. Particularly in the case of biodegradability, accessing sufficient databases for training remains challenging, and a well-defined mathematical and quantitative method for determining the biodegradability of newly synthesized molecules has yet to be established. Given these constraints, a viable approach for biodegradability prediction involves enabling the neural network to learn molecular features. Therefore, in this study, we introduce an integrated methodology that significantly advances the field of biodegradability prediction and material discovery. This innovative approach combines deep learning techniques, generative models, and reinforcement learning to address the complex task of efficiently identifying novel biodegradable organic compounds. Our research establishes a robust data preparation pipeline, utilizing SMILES notations for versatile compound representation and employing data augmentation techniques to enhance dataset diversity. The proposed prediction model adopts a hybrid architecture, leveraging long short-term memory (LSTM) networks and convolutional neural networks (CNNs), effectively handling sequential data and spatial patterns to provide highly accurate biodegradability predictions. By adopting a stack augmented RNN for molecular trajectory generation within a reinforcement learning framework, our generator model empowers the exploration of intricate chemical spaces, facilitating the discovery of environmentally friendly materials. Furthermore, our research incorporates a reward mechanism that quantifies the value of molecular structures based on biodegradability, thus ensuring the alignment of the learning process with environmentally conscious objectives. We also employ a systematic grid search for hyperparameter optimization, guaranteeing that model configurations are finely tuned for optimal predictive accuracy. The rest of the study is structured as follows. Section 2 describes the algorithms and procedures implemented in this work. Section 3 presents the simulation results, comparative analysis, and discussion of findings. The study is concluded in Section 4, wherein an overview of the contributions of this study and its applications are presented. 2. Methodology This section comprehensively describes the solution strategies and algorithms adopted in executing the study. The data processing methods, prediction models, optimization steps, and generator models are discussed. 2.1 Data preparation and processing The rapid advancement of computing has opened new avenues for predicting and exploring the biodegradability of organic compounds. Existing methods often require laborious and computationally expensive DFT calculations, hindering their scalability and efficiency. This research aims to develop an AI-driven model that leverages molecular formulas and structures for efficient biodegradability prediction. To represent a large number of compounds, we employ simple and independent nomenclatures (SMILES) that are easy for computers to understand. These nomenclatures allow us to effectively encode and process the chemical structures of compounds in the AI models. The SMILES notation allows flexibility in representing molecules by specifying the connectivity of atoms through their bonds. Different starting atoms or bond connectivity result in distinct SMILES strings, enabling multiple valid representations for the same compound. The SMILES compounds are also converted into structural images for subsequent training. An example of compounds, their respective SMILES notations, and structural images is depicted in Fig. 1 . A diverse dataset of 1055 organic compounds with known readily biodegradable (RB) materials (355 species) and non-readily biodegradable materials (700 species) are obtained from Kamel et. al., [ 19 ]. Detailed description of the data is therefore available as referenced. The dataset was shuffled to ensure a random distribution, and subsequently divided into specific segments for training and validation. The SMILES strings were converted into canonical forms, ensuring a standardized representation of each chemical compound. Additionally, random permutations of atomic indices were generated to augment the dataset, providing diverse representations of the same chemical structures. A tokenization procedure was applied to the SMILES strings to separate them into individual atomic symbols and other special characters. The set of unique tokens obtained was mapped to corresponding indices, creating a consistent format for subsequent training. The length of the tokenized SMILES strings in the dataset was evaluated, and the maximum length was determined, allowing for the consistent handling of SMILES strings of varying lengths. The dataset was further processed to generate input-output pairs suitable for training LSTM networks, involving randomizing the SMILES strings and converting them into a tensor format. A conversion process was implemented to transform characters or strings into corresponding tensor formats. This facilitated the handling of data within the deep learning framework. Training the model with different SMILES representations and images of the same compound at each iteration can enhance the model’s generalizability as the dataset increases. This approach allows the model to learn diverse representations of the same compound, capturing various aspects of its chemical structure and visual characteristics. The training process benefits from the increased variability in the data, enabling the model to better generalize and make accurate predictions on unseen compounds. This technique promotes robustness and adaptability in the model's learning process, ultimately improving its performance in biodegradability prediction and material discovery. 2.2 Prediction Model Building In this study, a hybrid approach leveraging two distinct deep learning architectures, namely LSTM networks and CNN, was developed to tackle the predictive task encompassing the analysis of chemical structures. LSTM networks are efficient at processing time series and textual data, which are essential in extracting features in organic compounds. They excel in recognizing long-term dependencies and patterns within sequential data, such as chemical structures and physical properties, which are crucial for predicting biodegradability. LSTM's ability to retain and utilize historical information allows for accurate biodegradability predictions by learning from molecular descriptors and their effects over time. CNNs are effective in biodegradability prediction by extracting features from image data of chemical structures. Training on these structures and their biodegradability labels, CNNs identify local patterns and spatial relationships key to assessing biodegradation potential. Convolutional layers use filters to capture significant features at different scales, enabling CNNs to forecast the biodegradability of previously unseen compounds with enhanced precision. The combined architecture synthesizes the inherent strengths of both LSTM and CNN models, facilitating the interpretation of complex patterns within data represented through both sequences and images. The LSTM component, constructed as a two-layer model accepting inputs of dimension 165, was employed for its ability to handle sequential data, reflecting the sequential nature of chemical information in SMILES strings. An embedding layer was incorporated with an output dimension of 12, effectively reducing dimensionality and capturing semantic relationships, represented by: $$e(x)={W_e} \cdot x+{b_e}$$ 1 where x represents the input, W e represents the embedding matrix, and b e represents the bias. The LSTM layer, consisting of 256 units, provides the network's memory function, capturing long-term dependencies and patterns over time, making it highly relevant for analyzing the chemical structure of organic compounds and their biodegradability. The layer can be mathematically represented as [ 20 – 22 ]: $$\begin{array}{*{20}{l}} {{f_t}=\sigma ({W_f} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_f})} \\ {{i_t}=\sigma ({W_i} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_i})} \\ {{o_t}=\sigma ({W_o} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_o})} \\ {{c_t}={f_t} \odot {c_{t - 1}}+{i_t} \odot tanh({W_c} \cdot \left[ {{h_{t - 1}},{x_t}} \right]+{b_c})} \\ {{h_t}={o_t} \odot tanh\left( {{c_t}} \right)} \end{array}$$ 2 where f t , i t , and o t are the forget, input, and output gates, c t is the cell state, h t is the hidden state, σ is the sigmoid activation function, and ⊙ represents elementwise multiplication. Subsequent layers included a dropout layer with a rate of 0.3, to prevent overfitting, and a dense layer with 35 units employing a hyperbolic tangent activation function and He normal initialization, enhancing the network's ability to capture non-linear relationships, all of which were obtained by grid search optimization as described in the subsequent section. The output layer is defined as: $$y=tanh({W_d} \cdot h+{b_d})$$ 3 where y is the output vector or tensor, W d is the weight matrix connecting the previous layer's outputs h to the current layer's inputs, h is the input vector or tensor from the previous layer, and b d is the bias vector added to the weighted sum before applying the activation function. Conversely, the CNN model was adopted for its effectiveness in analyzing spatial patterns within images, pertinent to the 300×300 images with three channels used in this study. The model initiated with a Conv2D layer composed of 6 filters of size 3×3 and strides of 4×4, represented as: $${Y_{ij}}={\sum _{m,n}}{X_{i+m,j}}_{{+n}} \cdot {K_{mn}}+b$$ 4 where Y ij is the output feature map at position ( i,j ), X i+m,j+n are input values at relative positions, K mn are convolutional filter weights, and b is the bias term. Followed by batch normalization and ReLU activation to accelerate training and introduce non-linearity: $$\begin{gathered} {Y_{normalized}}=\frac{{Y - \mu }}{{\sqrt {{\sigma ^2}+\epsilon } }} \hfill \\ {Y_{ReLU}}=max(0,{Y_{normalized}}) \hfill \\ \end{gathered}$$ 5 Subsequent max-pooling layers reduced dimensionality and emphasized salient features, while the sequence concluded with a flattening step, a dropout layer with a rate of 0.3, and a dense layer of 50 neurons with ReLU activation [ 23 , 24 ] and He normal initialization, further contributing to robust feature extraction. These hyperparameters were also obtained via the grid search optimization. The outputs from the LSTM and CNN models were concatenated, capitalizing on their synergistic strengths, followed by two dense layers with 40 and 2 units, respectively. The latter employed a SoftMax activation function[ 25 ], enabling probabilistic interpretation of the model's predictions: $$y=SoftMax(W \cdot [{h_{LSTM}},{h_{CNN}}]+b)$$ 6 The combined model was compiled with the Adam optimizer[ 26 – 28 ] at a learning rate of 0.0001 and categorical cross-entropy loss function[ 29 , 30 ], optimizing for multi-class classification performance: $L= - \frac{1}{N}\sum\nolimits_{{i=1}}^{N} {{y_i}log\left( {\widehat {{{y_i}}}} \right)}$ (7) where y i denotes the true label, ${\widehat{y}}_{i}$ denotes the predicted label for each sample in the batch of size N. Model parameters were saved and loaded from the disk, enhancing reproducibility, and allowing for further utility. For training, an iterative tokenization procedure was applied to the training and validation datasets across a sequence of times, aligning with the sequential nature of the data. The combined model was fit for 2000 epochs with a batch size of 10, balancing the trade-off between computational efficiency and convergence stability. Following training, the model underwent evaluation on a test dataset, and various functionalities were deployed, including saving, loading best models, and executing predictions with the optimally performing model. Additionally, a series of utility functions were employed to perform crucial tasks such as validation of SMILES strings, generation, and canonical conversion of specific SMILES strings, pairwise similarity computation, prediction using generated SMILES strings, simple moving average calculation, reward calculation, and similarity and canonical checks on generated strings. These functions not only enriched the model's interpretive capability but also facilitated a more nuanced assessment and interpretation of predictions concerning biodegradable and non-biodegradable chemical structures. Collectively, the integrated methodology provided a robust framework for predictive analysis, merging sequence understanding with spatial pattern recognition and supporting comprehensive validation and interpretive analysis. 2.3 Generator model building Developing novel biodegradable materials is crucial in modern materials science, contributing to sustainable development and environmental protection. In this research, a methodology is constructed leveraging reinforcement learning (RL), uniquely suited to this task due to its ability to explore and optimize complex, high-dimensional spaces. The RL model consists of three primary components: the generator, predictor, and reward function, each with distinct implications. (1) Generator : Utilized for generating molecular trajectories, the generator, adopted from Popova et al.[ 31 , 32 ] is the core of the explorative aspect of the RL framework. It symbolizes the ability to propose new molecular structures in the search space, allowing the discovery of potentially novel biodegradable materials. The generator model is a stack-augmented RNN developed using PyTorch. It consists of an Embedding layer to translate the input x into continuous space, e(x) , facilitating the nuanced processing of molecular structures and understanding complex relations within the molecules. The gated recurrent unit (GRU) is employed, whose update and reset gates are governed by: $$\begin{gathered} {r_t}=\sigma ({W_r} \cdot [{h_t} - 1,{x_t}]+{b_r}) \hfill \\ {z_t}=\sigma ({W_z} \cdot [{h_t} - 1,{x_{t}}]+{b_z}) \hfill \\ \end{gathered}$$ 8 and its hidden state by: $${h_t}_{}=(1 - {z_t}) \odot {h_t}_{{ - 1}}+{z_t} \odot \widetilde {{{h_t}}}$$ 9 where σ is the sigmoid function, to enhance the handling of sequential data and SMILES representations, vital for capturing temporal dependencies in molecular design. r t and z t denote the reset and update gates at time t, h t−1 is the previous hidden state, x t is the current input, W r , W z , b r , and b z are the weight matrices and bias terms. An innovative feature of this model is the stack augmentation mechanism, which is central to generating diverse and complex molecular trajectories. The stack operation equations, governed by push, pop, and no-op controls, enable flexible and intelligent manipulation of the stack structure. The decoder, coupled with LogSoftmax activation, translates the GRU's output and ensures normalization, fundamental for accurate prediction and selection of the next molecular character. $${y_t}=LogSoftmax({W_o} \cdot {h_t}+{b_{o}})$$ 10 where W o and b o are the weight matrix and bias term of the output layer, and y t is the predicted output at time t. The training and evaluation functions encapsulate the learning process, which is essential for adapting the model to generate desired molecular structures. The loss is computed using the Cross-Entropy Loss function. Additionally, various utilities, including changing the learning rate and handling stack operations, enhance the flexibility and efficiency of the model. (2) Predictor : Previously described in Section 2.2, this component evaluates the generated trajectories, functioning as the evaluative mechanism within the RL environment. It serves as the scientific bridge between the mathematical formulations of RL and the physical properties of molecules, providing tangible feedback based on generated molecular structures. (3) Reward Function : Computing the reward based on the generated sequence of molecular structures, the reward function plays a critical role in guiding the learning process. By quantifying the value of each structure in terms of biodegradability, ensures that the learning process aligns with the ultimate scientific goal of the research. Herein, a high reward is assigned if the generated material is not in the training data and is biodegradable. This allows the weights for the newly generated model to be updated. The reward is expressed as R(s, a) , where s denotes the state, and a denotes the action taken. The policy gradient method is applied, vital for continuous, high-dimensional action spaces common in molecular design. This method maximizes the expected cumulative reward [ 9 , 31 , 33 ], emphasizing the trajectories that lead to the most promising materials, according to the following equation: $${\nabla _\theta }J\left( \theta \right)={{\rm E}_{\pi \theta }}\left[ {{\nabla _\theta }log{\pi _\theta }(a\mid s){Q^\pi }(s,a)} \right]$$ 11 where θ is the policy parameter, π represents the policy, and Q π is the action-value function. Using gradient clipping ensures stable and robust convergence by avoiding the exploding gradient problem. The clipped gradient [ 33 ] can be represented as: $${\nabla _{clipped}}=min\left( {\nabla ,\frac{\nabla }{{\parallel \nabla \parallel }} \times threshold} \right)$$ 12 The iterative process, involving policy replay and updates, illustrates RL's dynamism. The update rule can be expressed using the Bellman equation[ 34 – 36 ]: $$Q\left( {s,a} \right) \leftarrow \left( {1 - \alpha } \right)Q\left( {s,a} \right)+\alpha \left( {r+\gamma ma{x_{a\prime }}Q\left( {s\prime ,a\prime } \right)} \right)$$ 13 where α is the learning rate, and γ is the discount factor. Furthermore, evaluating the generated SMILES strings' validity and canonicity ensures that the generated molecular structures are not only novel but also chemically accurate and practically feasible. Lastly, converting valid canonical SMILES into canonical form, and the subsequent visualization, encapsulates the synthesis of theoretical findings with practical applications, bridging computational discoveries with real-world chemical representations. The overall schematic representation of the proposed solution strategy is presented in Fig. 2 . 2.4 Hyperparameter optimization The grid search method[ 37 ] was employed to enhance the accuracy of the models, thereby eliminating the possibility of obtaining suboptimal models generally obtained via the conventional trial-and-error approach to model finetuning [ 23 , 24 ]. The grid search methodology represents a fundamental algorithmic approach for hyperparameter tuning[ 38 ]. In essence, we partition the domain of the hyperparameters into a discretized grid. Next, we systematically explore all possible permutations of values within this grid while concurrently evaluating various performance metrics through cross-validation. The grid point that yields the highest average value during cross-validation represents the optimal configuration of hyperparameters. Grid search is a meticulous algorithm that comprehensively explores all possible combinations, thereby enabling the identification of the optimal point within the given domain[ 39 ]. The significant limitation lies in its notably sluggish learning rate. Performing a comprehensive exploration of all spatial configurations necessitates a substantial amount of time. Acknowledging that each point within the grid necessitates k-fold cross-validation, a process that entails k-training iterations[ 40 ]. Thus, optimizing the hyperparameters of a model using this methodology can present significant intricacies and costs. However, exploring the synergistic effects of hyperparameters in pursuing optimal performance is prudent, where a grid search is a superior approach in this endeavor. The range of hyperparameters explored is presented in Table 1 . Table 1 Parameters used in the hyperparameter optimization. Type Range or candidates Number of epochs † 1–10000 Number of neurons in the first layer † 1–500 Number of neurons in the second layer † 1–100 Activation functions † Sigmoid, Hyperbolic tangent function, ReLU, leaky ReLU, ELU, and SELU Learning rates † 0.1, 0.05, 0.04, 0.03, 0.02, 0.01, 0.007, 0.005, 0.003, and 0.001 Loss functions † MSE and MAE Optimizer † AdaGrad, Adam, AdaMax, and Nadam † [ 41 , 42 ], ‡ [ 43 – 45 ] 3. Results and Discussion This section presents and discusses the findings from our proposed model and its comparison with previous models. Testing of the model on the electrolyte dataset to establish its generalizability and novel material discovery potential is also presented. 3.1. Model validation In Fig. 3 , we demonstrated the prediction capability of our integrated model in terms of training and validation datasets and compared them with the CNN- and LSTM-only models. It is worth noting that these models were optimized using the grid search method presented in Section 2.4. The proposed model exhibited a very high accuracy of 87.2% compared to 75.4% and 79.3% for the CNN and LSTM-only models, ascribed to the proposed model’s ability to learn both spatial and temporal dependencies in the SMILES data, enhancing its capability to efficiently predict the biodegradability of the organic materials. The CNN-only model significantly overfits the model, as revealed in its higher training accuracy (AUC) of 0.981, albeit with a much lower test accuracy of 0.834 AUC. The LSTM-only model, on the other hand, outperformed the CNN-only model in terms of generalizability, achieving a test score of 0.892 AUC, but underperformed the integrated model. The accuracy of the proposed integrated model can be further improved by increasing the number of molecules in the dataset, which is only 1055 in this case. It is worth noting that the hyperparameters employed in these models were optimized using the grid method elaborated in Section 2.4. 3.2 Novel Material Discovery Gated recurrent unit (GRU), which has a similar structure to LSTM, enables faster SMILES generation [ 46 ]. The GRU predicts the next character based on the current input character and hidden state. This allows for the generation of new compounds or materials. To train the GRU model, '' at the end of each SMILES sequence. This modification ensures that the model learns to recognize the start and end points of the SMILES sequences during training. This is demonstrated in this example: . From our results, approximately 60% of the generated SMILES are valid, indicating that most of them adhere to the structural rules governing the SMILES notation. In addition, more than 80% of these generated SMILES are distinct from the compounds in the training dataset, highlighting their novel characteristics and exploratory potential. Examples of the generated materials are presented in Fig. 4. Nevertheless, only about 40% of the generated SMILES were biodegradable (Fig. 5a). Hence, the biodegradable material generation ability of the model must be enhanced. By leveraging the developed biodegradability prediction model and the GRU-based SMILES generator, we can harness the power of RL to explore and discover novel organic compounds with inherent biodegradable properties. Here, a high reward is assigned if the generated material is not in the training data and is biodegradable. This allows the weights of the newly generated model to be updated. Therefore, we next present the results of the final generator model integrated with RL. The generative model has been successfully trained through the RL to discover more biodegradable compounds. Compared to the GRU-based generative model without RL, the final model could discover about 95% of biodegradable materials (Fig. 5(a) and (b)), of which 42% is not present in the original training dataset, demonstrating the novel material discovery capability of our model. By incorporating constraints on the similarity between specific functional groups/atoms and the generated compounds, we could generate diverse materials while preserving specific functional group/atom characteristics (Fig. 5c). Next, we compared our model with the state-of-art model proposed by Popova et al.[ 31 ] for De novo drug design and obtained superior prediction results in terms of ROC, RMSE, and MAPE. Notably, our model significantly outperformed the state-of-the-art model with a training AUC of 0.974 and a testing AUC of 0.916 compared to 0.913 and 0.891, respectively. This result is ascribed to the proposed model’s ability to learn both spatial and temporal dependencies in the trained data set using the CNN-LSTM integrated model. It is worth mentioning that the CNN component of the proposed model leads to overfitting (wider gap between the train and test results) and thus was carefully optimized to yield the expected synergistic effect. In terms of computational time, both models achieve nearly the same training time of 2 hours and 20 minutes, indicating that no computing burden is incurred by the proposed model albeit with better accuracy. In addition, we tested the capability of the developed model in designing novel electrolytes as part of this research, recognizing their significance in today's world. Electrolytes are fundamental components in numerous applications, spanning from energy storage systems like batteries and supercapacitors to vital functions in biological systems. We focused on developing an electrolyte with specific properties such as low viscosity, high conductivity, and cost-effectiveness. To achieve this, we explored the molecular weights of the electrolytes (leveraging large PubChem data[ 47 – 49 ]), which have a direct correlation with viscosity– a primary determinant of electrolyte performance. Electrolytes are particularly instrumental in regulating ion transport within various electrochemical devices, making their viscosity a key factor in overall efficiency. For the prediction model, we utilized a normalized molecular weight distribution with a mean of 60 g/mol and a deviation of 0.7. This distribution allows us to assess the molecular weight of the generated chemical SMILES. If the weight is either too light or too heavy, the probability assigned to it based on the assumed distribution would be significantly low. We have incorporated this probability as a penalty and reward mechanism within the RL framework (Fig. 7 a). By incorporating these methodologies, we could develop a novel electrolyte that meets the desired criteria of low viscosity, high conductivity, and cost-effectiveness, contributing to the advancement in organic materials and their applications. From Fig. 7 b, the desired lithium-based electrolytes with the expected molecular weight were generated using the model presented in Fig. 7 a. The distribution in Fig. 7 c clearly indicates that after incorporating the RL, we could achieve a much wider molecular weight distribution. Herein, specific or desired molecular weight compounds could be generated by feeding such information to the RL framework, drastically improving the design of experiment approaches and achieving facile material discovery. 4. Conclusion This study presents a comprehensive analysis of our proposed integrated model for biodegradability prediction and novel material discovery. The model's predictive capabilities were validated, demonstrating superior performance compared to CNN- and LSTM-only models. The integrated model achieved an impressive 87.2% AUC, showcasing its ability to learn spatial and temporal dependencies in SMILES data. Our novel material discovery approach, utilizing a GRU-based SMILES generator within a reinforcement learning framework, showed significant potential. Around 60% of the generated SMILES were valid, and over 80% were distinct from the training dataset, indicating their novelty. Moreover, through RL, we enhanced the model's ability to generate biodegradable materials, with approximately 95% being biodegradable, including 42% not present in the original training dataset. Furthermore, we compared our model to a state-of-the-art model proposed for De novo drug design and achieved superior results in terms of ROC, highlighting the model's potential in diverse applications. Expanding the scope of our research to the design of novel electrolytes by employing large-scale molecular data, we developed a novel electrolyte with specific properties like low viscosity, high conductivity, and cost-effectiveness, contributing to the advancement of organic materials and their applications. Our integrated model has shown exceptional promise in biodegradability prediction, material discovery, and electrolyte design. Future work could further enhance the model's capabilities and explore its applications in various material discovery fields. This research represents a significant step towards leveraging artificial intelligence for material discovery and design in today's dynamic scientific landscape. Declarations Acknowledgment This research was supported by INHA UNIVERSITY Research References F. Wu, M. Misra, A.K. Mohanty, Challenges and new opportunities on barrier performance of biodegradable polymers for sustainable packaging, Prog Polym Sci. 117 (2021) 101395. https://doi.org/10.1016/j.progpolymsci.2021.101395. R. Grace, Closing the Circle: Reshaping How Products are Conceived & Made, Plastics Engineering. 73 (2017) 8–11. https://doi.org/10.1002/j.1941-9635.2017.tb01670.x. F. Allen, J. Gasparro, J. Swaney, M. Phelan, J. Gillespie, Directive 2004/38/EC of the European Parliament and of the Council of 29 April 2004, Immigration Law Handbook. (2023) 2253-C79P212. https://doi.org/10.1093/oso/9780192896292.003.0079. Test No. 301: Ready Biodegradability, OECD, 1992. https://doi.org/10.1787/9789264070349-en. Identification of biodegradation models under model and data uncertainty, Water Science and Technology. 33 (1996). https://doi.org/10.1016/0273-1223(96)00192-8. P.G. Polishchuk, T.I. Madzhidov, A. Varnek, Estimation of the size of drug-like chemical space based on GDB-17 data, J Comput Aided Mol Des. 27 (2013) 675–679. https://doi.org/10.1007/s10822-013-9672-4. D. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J Chem Inf Comput Sci. 28 (1988) 31–36. https://doi.org/10.1021/ci00057a005. C. Bilodeau, W. Jin, T. Jaakkola, R. Barzilay, K.F. Jensen, Generative models for molecular discovery: Recent advances and challenges, WIREs Computational Molecular Science. 12 (2022). https://doi.org/10.1002/wcms.1608. M. Olivecrona, T. Blaschke, O. Engkvist, H. Chen, Molecular de-novo design through deep reinforcement learning, J Cheminform. 9 (2017) 48. https://doi.org/10.1186/s13321-017-0235-x. P.-H. Chiu, Y.-L. Yang, H.-K. Tsao, Y.-J. Sheng, Deep learning for predictions of hydrolysis rates and conditional molecular design of esters, J Taiwan Inst Chem Eng. 126 (2021) 1–13. https://doi.org/10.1016/j.jtice.2021.06.045. M. Wang, C.-Y. Hsieh, J. Wang, D. Wang, G. Weng, C. Shen, X. Yao, Z. Bing, H. Li, D. Cao, T. Hou, RELATION: A Deep Generative Model for Structure-Based De Novo Drug Design, J Med Chem. 65 (2022) 9478–9492. https://doi.org/10.1021/acs.jmedchem.2c00732. J. Arús-Pous, A. Patronov, E.J. Bjerrum, C. Tyrchan, J.-L. Reymond, H. Chen, O. Engkvist, SMILES-based deep generative scaffold decorator for de-novo drug design, J Cheminform. 12 (2020) 38. https://doi.org/10.1186/s13321-020-00441-8. N. De Cao, T. Kipf, MolGAN: An implicit generative model for small molecular graphs, ArXiv. abs/1805.1 (2018) null. https://www.semanticscholar.org/paper/def1049b5aae96c8e1eab0ca58d77ac9c2f0e3e9. W. Tang, Y. Li, Y. Yu, Z. Wang, T. Xu, J. Chen, J. Lin, X. Li, Development of models predicting biodegradation rate rating with multiple linear regression and support vector machine algorithms, Chemosphere. 253 (2020) 126666. https://doi.org/10.1016/j.chemosphere.2020.126666. O. Dollar, N. Joshi, D.A.C. Beck, J. Pfaendtner, Attention-based generative models for de novo molecular design, Chem Sci. 12 (2021) 8362–8372. https://doi.org/10.1039/d1sc01050f. F. Lunghini, G. Marcou, P. Gantzer, P. Azam, D. Horvath, E. Van Miert, A. Varnek, Modelling of ready biodegradability based on combined public and industrial data sources, SAR QSAR Environ Res. 31 (2019) 171–186. https://doi.org/10.1080/1062936x.2019.1697360. W.F.C. Rocha, D.A. Sheen, Classification of biodegradable materials using QSAR modelling with uncertainty estimation, SAR QSAR Environ Res. 27 (2016) 799–811. https://doi.org/10.1080/1062936X.2016.1238010. K. Acharya, D. Werner, J. Dolfing, M. Barycki, P. Meynet, W. Mrozik, O. Komolafe, T. Puzyn, R.J. Davenport, A quantitative structure-biodegradation relationship (QSBR) approach to predict biodegradation rates of aromatic chemicals, Water Res. 157 (2019) 181–190. https://doi.org/10.1016/j.watres.2019.03.086. R.T.B.D.T.R. Mansouri Kamel, V. Consonni, QSAR biodegradation, (2013). P. Dey, S.K. Chaulya, S. Kumar, Hybrid CNN-LSTM and IoT-based coal mine hazards monitoring and prediction system, Process Safety and Environmental Protection. 152 (2021) 249–263. https://doi.org/10.1016/J.PSEP.2021.06.005. Y. Zhao, Improvement and Application of Multi-layer LSTM Algorithm Based on Spatial-Temporal Correlation, Ingénierie Des Systèmes d Inf. 25 (2020) null. https://doi.org/10.18280/isi.250107. C. Ding, G. Wang, X. Zhang, Q. Liu, X. Liu, A hybrid CNN-LSTM model for predicting PM2.5 in Beijing based on spatiotemporal correlation, Environ Ecol Stat. 28 (2021) 503–522. https://doi.org/10.1007/s10651-021-00501-8. D.Q. Gbadago, J. Moon, M. Kim, S. Hwang, A unified framework for the mathematical modelling, predictive analysis, and optimization of reaction systems using computational fluid dynamics, deep neural network and genetic algorithm: A case of butadiene synthesis, Chemical Engineering Journal. 409 (2021) 128163. https://doi.org/10.1016/j.cej.2020.128163. J. Moon, D.Q. Gbadago, G. Hwang, D. Lee, S. Hwang, Software platform for high-fidelity-data-based artificial neural network modeling and process optimization in chemical engineering, Comput Chem Eng. 158 (2022) 107637. https://doi.org/10.1016/J.COMPCHEMENG.2021.107637. P. Dey, K. Saurabh, C. Kumar, D. Pandit, S.K. Chaulya, S. Ray, G.M. Prasad, S.K. Mandal, t-SNE and variational auto-encoder with a bi-LSTM neural network-based model for prediction of gas concentration in a sealed-off area of underground coal mines, Soft Comput. 25 (2021) 14183–14207. https://doi.org/10.1007/s00500-021-06261-8. W. Wang, A Pre-trained Conditional Transformer for Target-speciﬁc De Novo Molecular Generation, (2022). https://www.semanticscholar.org/paper/ed9763062daec0eec7ceb65e822360e340c75605. X. Yang, Z. Zhang, An attention-based domain spatial-temporal meta-learning (ADST-ML) approach for PM2.5 concentration dynamics prediction, Urban Clim. null (2023) null. https://doi.org/10.1016/j.uclim.2022.101363. N. Xu, X. Wang, X. Meng, H. Chang, Gas Concentration Prediction Based on IWOA-LSTM-CEEMDAN Residual Correction Model, Sensors (Basel). 22 (2022) null. https://doi.org/10.3390/s22124412. L. Pingyang, N. Chen, M. Shanjun, L. Mei, LSTM based encoder-decoder for short-term predictions of gas concentration using multi-sensor fusion, Process Safety and Environmental Protection. 137 (2020) 93–105. https://doi.org/10.1016/j.psep.2020.02.021. K. Kumari, P. Dey, C. Kumar, D. Pandit, S. Mishra, V. Kisku, S.K. Chaulya, S. Ray, G.M. Prasad, UMAP and LSTM based fire status and explosibility prediction for sealed-off area in underground coal mine, Process Safety and Environmental Protection. 146 (2021) 837–852. https://doi.org/10.1016/j.psep.2020.12.019. M. Popova, O. Isayev, A. Tropsha, Deep reinforcement learning for de novo drug design, Sci Adv. 4 (2018) eaap7885–eaap7885. https://doi.org/10.1126/sciadv.aap7885. M. Popova, M. Shvets, J.B. Oliva, O. Isayev, MolecularRNN: Generating realistic molecular graphs with optimized properties, ArXiv. abs/1905.1 (2019) null. https://www.semanticscholar.org/paper/3ccd291c8848c73ca34152e27c3ec296cfc838d0. Z. Zhou, S. Kearnes, L. Li, R. Zare, P.F. Riley, Optimization of Molecules via Deep Reinforcement Learning, Sci Rep. 9 (2018) null. https://doi.org/10.1038/s41598-019-47148-x. Bellman-consistent Pessimism for Offline Reinforcement Learning | OpenReview, (n.d.). https://openreview.net/forum?id=e8WWUBeafM (accessed October 10, 2023). B. O’donoghue, I. Osband, R. Munos, V. Mnih, The Uncertainty Bellman Equation and Exploration, (2018). Y. Fei, Z. Yang, Y. Chen, Z. Wang, Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning, (n.d.). H.A. Fayed, A.F. Atiya, Speed up grid-search for parameter selection of support vector machines, Appl Soft Comput. 80 (2019) 202–210. https://doi.org/10.1016/J.ASOC.2019.03.037. S.M. LaValle, M.S. Branicky, S.R. Lindemann, On the Relationship between Classical Grid Search and Probabilistic Roadmaps, Http://Dx.Doi.Org/10.1177/0278364904045481. 23 (2004) 673–692. https://doi.org/10.1177/0278364904045481. P. Liashchynskyi, P. Liashchynskyi, Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS, (2019). https://arxiv.org/abs/1912.06059v1 (accessed October 11, 2023). F.J. Pontes, G.F. Amorim, P.P. Balestrassi, A.P. Paiva, J.R. Ferreira, Design of experiments and focused grid search for neural network parameter optimization, Neurocomputing. 186 (2016) 22–34. https://doi.org/10.1016/J.NEUCOM.2015.12.061. R.Y. Acharya, N.F. Charlot, M.M. Alam, F. Ganji, D. Gauthier, D. Forte, Chaogate parameter optimization using bayesian optimization and genetic algorithm, Proceedings - International Symposium on Quality Electronic Design, ISQED. 2021-April (2021) 426–431. https://doi.org/10.1109/ISQED51717.2021.9424355. H. Alibrahim, S.A. Ludwig, Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization, IEEE Congress on Evolutionary Computation (CEC). (2021) 1551–1559. https://doi.org/10.1109/cec45853.2021.9504761. Y. Shin, Z. Kim, J. Yu, G. Kim, S. Hwang, Development of NOx reduction system utilizing artificial neural network (ANN) and genetic algorithm (GA), J Clean Prod. 232 (2019) 1418–1429. https://doi.org/10.1016/j.jclepro.2019.05.276. D.Q. Gbadago, J. Moon, M. Kim, S. Hwang, A unified framework for the mathematical modelling, predictive analysis, and optimization of reaction systems using computational fluid dynamics, deep neural network and genetic algorithm: A case of butadiene synthesis, Chemical Engineering Journal. 409 (2021) 128163. https://doi.org/10.1016/j.cej.2020.128163. F. Mohammadi, M.R. Samaei, A. Azhdarpoor, H. Teiri, A. Badeenezhad, S. Rostami, Modelling and Optimizing Pyrene Removal from the Soil by Phytoremediation using Response Surface Methodology, Artificial Neural Networks, and Genetic Algorithm, Chemosphere. 237 (2019) 124486. https://doi.org/10.1016/j.chemosphere.2019.124486. B. Athiwaratkun, J.W. Stokes, Malware classification with LSTM and GRU language models and a character-level CNN, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. (2017) 2482–2486. https://doi.org/10.1109/ICASSP.2017.7952603. S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B.A. Shoemaker, P.A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E.E. Bolton, PubChem 2023 update, Nucleic Acids Res. 51 (2023) D1373–D1380. https://doi.org/10.1093/NAR/GKAC956. V.D. Hähnke, S. Kim, E.E. Bolton, PubChem chemical structure standardization, J Cheminform. 10 (2018). https://doi.org/10.1186/S13321-018-0293-8. S. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B.A. Shoemaker, P.A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E.E. Bolton, PubChem 2019 update: improved access to chemical data, Nucleic Acids Res. 47 (2019) D1102–D1109. https://doi.org/10.1093/NAR/GKY1033. Cite Share Download PDF Status: Published Journal Publication published 12 Jun, 2024 Read the published version in Korean Journal of Chemical Engineering → Version 1 posted Editorial decision: Major Revisions Needed 07 Apr, 2024 Reviewers agreed at journal 15 Mar, 2024 Reviewers invited by journal 13 Mar, 2024 Editor assigned by journal 04 Mar, 2024 First submitted to journal 28 Feb, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4002218","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":279382306,"identity":"27b2641c-d993-40e3-b906-9eb2cea4745b","order_by":0,"name":"Dela Quarme Gbadago","email":"","orcid":"","institution":"Inha University","correspondingAuthor":false,"prefix":"","firstName":"Dela","middleName":"Quarme","lastName":"Gbadago","suffix":""},{"id":279382307,"identity":"cd40f0dd-83d9-499c-ad3e-7be9cef2e49c","order_by":1,"name":"Gyuyeong Hwang","email":"","orcid":"","institution":"Inha University","correspondingAuthor":false,"prefix":"","firstName":"Gyuyeong","middleName":"","lastName":"Hwang","suffix":""},{"id":279382308,"identity":"aa9ddc3c-e9e9-4d47-8cb2-b89dc5f452f2","order_by":2,"name":"Kihwan Lee","email":"","orcid":"","institution":"Inha University","correspondingAuthor":false,"prefix":"","firstName":"Kihwan","middleName":"","lastName":"Lee","suffix":""},{"id":279382309,"identity":"b9b99c4d-f5ed-48dd-b698-2cd7684abdd1","order_by":3,"name":"Sungwon Hwang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA/UlEQVRIiWNgGAWjYDADAzBZIQFlMCQQq+UMyVoY2xgIa5GfkXv4NU/FHQZzieRnD7/Os8gzZz/A+OEHQ1o+TsNv5KVZ85x5xmA5I83cWHabRLFlTwKzZA9DjmUDLi0SOWbGuW2HGQxuJ5hJS26TSNxwIIFBGhgQBrgdBteS/k1acg5Qy/kHzL/xaWG4kWP8GKIlx0zyYwNQy40ENqAtOTi1GJx5Y8b858xhHoP7b8qkGY6BtDxss+wxSMPtsPYc448zKg7LGZw5vk3yR00d0GHJh2/8qEjG7TAGBjYJIMEDYjGDSQbGBnjs4ADMH2Asxh94FY6CUTAKRsFIBQDMvFbI3giZqwAAAABJRU5ErkJggg==","orcid":"https://orcid.org/0000-0003-0744-3931","institution":"Inha University","correspondingAuthor":true,"prefix":"","firstName":"Sungwon","middleName":"","lastName":"Hwang","suffix":""}],"badges":[],"createdAt":"2024-03-01 05:55:08","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4002218/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4002218/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s11814-024-00202-5","type":"published","date":"2024-06-12T15:12:10+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":52923514,"identity":"fd9db788-bae4-4942-b1b7-ec10e6d13797","added_by":"auto","created_at":"2024-03-18 17:47:32","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":78436,"visible":true,"origin":"","legend":"\u003cp\u003e(a) Representation of SMILES Tokenization, (b) different SMILES representations of 3-Ethylpheonl, (c) samples of compounds used during the model training, and (d) distribution of materials in the dataset.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/84fc7daadb7612282819a25b.png"},{"id":52924760,"identity":"2483ea55-0b4a-4159-89bc-f57bd3bb5f17","added_by":"auto","created_at":"2024-03-18 17:55:32","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":68798,"visible":true,"origin":"","legend":"\u003cp\u003eGraphical representation of the proposed novel AI model for material discovery and prediction\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/2fed320ae410cb9be58af202.png"},{"id":52923516,"identity":"7ae039e8-95e6-487e-bfdb-8ff142e89044","added_by":"auto","created_at":"2024-03-18 17:47:32","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":147109,"visible":true,"origin":"","legend":"\u003cp\u003eComparison plots between (a) the CNN-only model, (b) the LSTM-only model, and (c) the LSTM-CNN integrated model results.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/e2c276e3bf3f5ff540e833ed.png"},{"id":52923517,"identity":"129fae3c-9fce-4fb0-990a-664123a87a46","added_by":"auto","created_at":"2024-03-18 17:47:32","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":118883,"visible":true,"origin":"","legend":"\u003cp\u003eGenerated novel biodegradable materials from the GRU\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/de013f5a78ca8bb95ad4566c.png"},{"id":52923519,"identity":"6035cb2f-b31c-4fdb-bccd-234f3699e0ce","added_by":"auto","created_at":"2024-03-18 17:47:32","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":41311,"visible":true,"origin":"","legend":"\u003cp\u003eComparison between (a) GRU-only generative model results and (b) GRU-RL generative model results. (c) Newly discovered materials with diverse functional groups.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/8ed0e5c4440e01054ee70ced.png"},{"id":52924761,"identity":"49923622-ab89-4006-8c65-9c95d4608e4d","added_by":"auto","created_at":"2024-03-18 17:55:32","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":130534,"visible":true,"origin":"","legend":"\u003cp\u003eComparison of (a) state-of-the-art model [31] results with our (b) our proposed model.\u003c/p\u003e","description":"","filename":"6.png","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/91398a7f7059444947ca2a98.png"},{"id":52923520,"identity":"ba34fdb2-9fde-488e-bb26-ea539b86a583","added_by":"auto","created_at":"2024-03-18 17:47:32","extension":"png","order_by":7,"title":"Figure 7","display":"","copyAsset":false,"role":"figure","size":96006,"visible":true,"origin":"","legend":"\u003cp\u003e(a) Electrolyte generation scheme, (b) generated electrolytes, and their (c) distribution.\u003c/p\u003e","description":"","filename":"7.png","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/f79dfd8cdcfbb7a3bda0f417.png"},{"id":58822931,"identity":"77c9d019-1d0c-45c1-ab05-6f4640aab754","added_by":"auto","created_at":"2024-06-21 16:49:33","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1110168,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4002218/v1/35a1c9a9-3ec5-4c6d-a9f9-6e640a453da2.pdf"}],"financialInterests":"","formattedTitle":"Deep Learning for Green Chemistry: An AI-Enabled Pathway for Biodegradability Prediction and Organic Material Discovery","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThroughout scientific development, humanity has produced an abundance of organic compounds, many of which are utilized once and then discarded. The yearly production of plastic has reached an astonishing 450\u0026nbsp;million tons, with 340\u0026nbsp;million tons being generated as waste [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. Regrettably, these organic compounds exhibit remarkable resistance to natural decomposition, leading to their persistence in the environment and posing significant threats to human well-being and ecosystems[\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e]. Consequently, assessing the biodegradability of organic compounds has been increasingly regarded as crucial in recent times. Following the European Registration, Evaluation, Authorization, and Restriction of Chemicals (REACH) regulation, companies engaged in the manufacturing or importing of chemicals exceeding 1 ton per year are mandated to provide detailed information regarding the biodegradability of their compounds[\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e]. To evaluate biodegradability, standardized test methods published by prestigious organizations such as the Organization for Economic Co-operation and Development (OECD)[\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] and Japan's Ministry of International Trade and Industry (MITI)[\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] are primarily employed. In addition to assessing the biodegradability of existing compounds, the significance of discovering novel biodegradable organic compounds is also growing. However, searching for potential candidates within the entire compound space is nearly impossible due to its vast scale, estimated to range from 10\u003csup\u003e23\u003c/sup\u003e to 10\u003csup\u003e60\u003c/sup\u003e. Predicting new molecules through calculations, synthesizing them, and testing their physical properties is time-consuming. As a result, only approximately 10\u003csup\u003e8\u003c/sup\u003e compounds have been synthesized thus far[\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eUtilizing generative models for discovering new molecules alleviates these challenges. Unlike conventional methods, generative models operate through inverse modeling. This means that new molecules are generated based on desired properties, offering a more efficient approach to exploration. Different methods have been devised to enable the incorporation of complex molecular structures into neural networks. One prevalent approach is using Simplified Molecular Input Line Entry System (SMILES)[\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e], which converts molecules into a one-dimensional text array following a specific set of rules. Due to its effectiveness, SMILES is widely employed in many molecular generation models. Recently, the use of generative models for chemical substance discovery has been actively researched[\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e]. Early generative models were developed by combining recurrent neural networks (RNNs) and reinforcement learning[\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e]. However, to overcome the limitations of these models, various types of generative models have been developed. Chiu et al.[\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e] proposed a method for predicting the hydrolysis rate by utilizing not only the SMILES representation but also the partial charge of the molecule as inputs to the autoencoder. Wang et al.[\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] addressed the challenge of balancing desirable properties and novelty in molecular design. They developed a model that interprets the ligand-receptor structure by taking the molecular 3D structure as an input. J Ar\u0026uacute;s-Pous et al.[\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e] divided the existing dataset into subsets with desired molecular scaffolds to devise a strategy to create molecules with specific characteristics without using reinforcement learning. Cao et al.[\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e] conducted research on avoiding the computationally expensive likelihood-matching process. They used generative adversarial networks (GANs) with graphs as inputs. Tang et al.[\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e] employed a Support Vector Machine (SVM) classifier to enhance the prediction accuracy and overcome the limitations of linear regression when predicting the biodegradability of large molecules. Dollar et al.[\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e] attempted to introduce the attention mechanism, commonly used in translation tasks, into variational autoencoders (VAE) for de novo molecular design. While several studies have been conducted in this area, there is a notable lack of research on generative models for discovering biodegradable organic compounds. The main challenge lies in training a model due to the severe insufficiency of the biodegradability database. In contrast to the readily available abundance of information, such as LogP, which can be easily accessed through methods like RDkit, the resources for biodegradability data remain scarce. As a response to this issue, a study was carried out by Lunghini et al.[\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e] to construct a substantial database by integrating various biodegradability data. Additionally, given the complex mechanisms determining the biodegradation rate, numerous models employing the Quantitative Structure-Activity Relationship (QSAR) method are being explored to classify compounds into biodegradable and non-biodegradable substances[\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e, \u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e]. However, these models are imperfect, mainly due to their limited applicability scope.\u003c/p\u003e \u003cp\u003eFurthermore, like the previous examples, much research has focused on enhancing prediction performance by altering the generative model. However, a limited body of research is dedicated to improving the prediction model. Particularly in the case of biodegradability, accessing sufficient databases for training remains challenging, and a well-defined mathematical and quantitative method for determining the biodegradability of newly synthesized molecules has yet to be established. Given these constraints, a viable approach for biodegradability prediction involves enabling the neural network to learn molecular features. Therefore, in this study, we introduce an integrated methodology that significantly advances the field of biodegradability prediction and material discovery. This innovative approach combines deep learning techniques, generative models, and reinforcement learning to address the complex task of efficiently identifying novel biodegradable organic compounds. Our research establishes a robust data preparation pipeline, utilizing SMILES notations for versatile compound representation and employing data augmentation techniques to enhance dataset diversity. The proposed prediction model adopts a hybrid architecture, leveraging long short-term memory (LSTM) networks and convolutional neural networks (CNNs), effectively handling sequential data and spatial patterns to provide highly accurate biodegradability predictions. By adopting a stack augmented RNN for molecular trajectory generation within a reinforcement learning framework, our generator model empowers the exploration of intricate chemical spaces, facilitating the discovery of environmentally friendly materials. Furthermore, our research incorporates a reward mechanism that quantifies the value of molecular structures based on biodegradability, thus ensuring the alignment of the learning process with environmentally conscious objectives. We also employ a systematic grid search for hyperparameter optimization, guaranteeing that model configurations are finely tuned for optimal predictive accuracy.\u003c/p\u003e \u003cp\u003eThe rest of the study is structured as follows. Section 2 describes the algorithms and procedures implemented in this work. Section 3 presents the simulation results, comparative analysis, and discussion of findings. The study is concluded in Section 4, wherein an overview of the contributions of this study and its applications are presented.\u003c/p\u003e"},{"header":"2. Methodology","content":"\u003cp\u003eThis section comprehensively describes the solution strategies and algorithms adopted in executing the study. The data processing methods, prediction models, optimization steps, and generator models are discussed.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1 Data preparation and processing\u003c/h2\u003e \u003cp\u003eThe rapid advancement of computing has opened new avenues for predicting and exploring the biodegradability of organic compounds. Existing methods often require laborious and computationally expensive DFT calculations, hindering their scalability and efficiency. This research aims to develop an AI-driven model that leverages molecular formulas and structures for efficient biodegradability prediction. To represent a large number of compounds, we employ simple and independent nomenclatures (SMILES) that are easy for computers to understand. These nomenclatures allow us to effectively encode and process the chemical structures of compounds in the AI models. The SMILES notation allows flexibility in representing molecules by specifying the connectivity of atoms through their bonds. Different starting atoms or bond connectivity result in distinct SMILES strings, enabling multiple valid representations for the same compound. The SMILES compounds are also converted into structural images for subsequent training. An example of compounds, their respective SMILES notations, and structural images is depicted in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e. A diverse dataset of 1055 organic compounds with known readily biodegradable (RB) materials (355 species) and non-readily biodegradable materials (700 species) are obtained from Kamel et. al., [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e]. Detailed description of the data is therefore available as referenced. The dataset was shuffled to ensure a random distribution, and subsequently divided into specific segments for training and validation. The SMILES strings were converted into canonical forms, ensuring a standardized representation of each chemical compound. Additionally, random permutations of atomic indices were generated to augment the dataset, providing diverse representations of the same chemical structures. A tokenization procedure was applied to the SMILES strings to separate them into individual atomic symbols and other special characters. The set of unique tokens obtained was mapped to corresponding indices, creating a consistent format for subsequent training. The length of the tokenized SMILES strings in the dataset was evaluated, and the maximum length was determined, allowing for the consistent handling of SMILES strings of varying lengths. The dataset was further processed to generate input-output pairs suitable for training LSTM networks, involving randomizing the SMILES strings and converting them into a tensor format. A conversion process was implemented to transform characters or strings into corresponding tensor formats. This facilitated the handling of data within the deep learning framework.\u003c/p\u003e \u003cp\u003eTraining the model with different SMILES representations and images of the same compound at each iteration can enhance the model\u0026rsquo;s generalizability as the dataset increases. This approach allows the model to learn diverse representations of the same compound, capturing various aspects of its chemical structure and visual characteristics. The training process benefits from the increased variability in the data, enabling the model to better generalize and make accurate predictions on unseen compounds. This technique promotes robustness and adaptability in the model's learning process, ultimately improving its performance in biodegradability prediction and material discovery.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2 Prediction Model Building\u003c/h2\u003e \u003cp\u003eIn this study, a hybrid approach leveraging two distinct deep learning architectures, namely LSTM networks and CNN, was developed to tackle the predictive task encompassing the analysis of chemical structures. LSTM networks are efficient at processing time series and textual data, which are essential in extracting features in organic compounds. They excel in recognizing long-term dependencies and patterns within sequential data, such as chemical structures and physical properties, which are crucial for predicting biodegradability. LSTM's ability to retain and utilize historical information allows for accurate biodegradability predictions by learning from molecular descriptors and their effects over time. CNNs are effective in biodegradability prediction by extracting features from image data of chemical structures. Training on these structures and their biodegradability labels, CNNs identify local patterns and spatial relationships key to assessing biodegradation potential. Convolutional layers use filters to capture significant features at different scales, enabling CNNs to forecast the biodegradability of previously unseen compounds with enhanced precision.\u003c/p\u003e \u003cp\u003eThe combined architecture synthesizes the inherent strengths of both LSTM and CNN models, facilitating the interpretation of complex patterns within data represented through both sequences and images. The LSTM component, constructed as a two-layer model accepting inputs of dimension 165, was employed for its ability to handle sequential data, reflecting the sequential nature of chemical information in SMILES strings. An embedding layer was incorporated with an output dimension of 12, effectively reducing dimensionality and capturing semantic relationships, represented by:\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$$e(x)={W_e} \\cdot x+{b_e}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cspan type=\"ItalicUnderline\" class=\"ItalicUnderline\" name=\"Emphasis\"\u003ex\u003c/span\u003e represents the input, \u003cem\u003eW\u003c/em\u003e\u003csub\u003e\u003cem\u003ee\u003c/em\u003e\u003c/sub\u003e represents the embedding matrix, and \u003cem\u003eb\u003c/em\u003e\u003csub\u003e\u003cem\u003ee\u003c/em\u003e\u003c/sub\u003e represents the bias.\u003c/p\u003e \u003cp\u003eThe LSTM layer, consisting of 256 units, provides the network's memory function, capturing long-term dependencies and patterns over time, making it highly relevant for analyzing the chemical structure of organic compounds and their biodegradability. The layer can be mathematically represented as [\u003cspan additionalcitationids=\"CR21\" citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]:\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$\\begin{array}{*{20}{l}} {{f_t}=\\sigma ({W_f} \\cdot \\left[ {{h_{t - 1}},{x_t}} \\right]+{b_f})} \\\\ {{i_t}=\\sigma ({W_i} \\cdot \\left[ {{h_{t - 1}},{x_t}} \\right]+{b_i})} \\\\ {{o_t}=\\sigma ({W_o} \\cdot \\left[ {{h_{t - 1}},{x_t}} \\right]+{b_o})} \\\\ {{c_t}={f_t} \\odot {c_{t - 1}}+{i_t} \\odot tanh({W_c} \\cdot \\left[ {{h_{t - 1}},{x_t}} \\right]+{b_c})} \\\\ {{h_t}={o_t} \\odot tanh\\left( {{c_t}} \\right)} \\end{array}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003ef\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003ei\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e, and \u003cem\u003eo\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e are the forget, input, and output gates, \u003cem\u003ec\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e is the cell state, \u003cem\u003eh\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e is the hidden state, \u003cem\u003eσ\u003c/em\u003e is the sigmoid activation function, and ⊙ represents elementwise multiplication.\u003c/p\u003e \u003cp\u003eSubsequent layers included a dropout layer with a rate of 0.3, to prevent overfitting, and a dense layer with 35 units employing a hyperbolic tangent activation function and He normal initialization, enhancing the network's ability to capture non-linear relationships, all of which were obtained by grid search optimization as described in the subsequent section. The output layer is defined as:\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$$y=tanh({W_d} \\cdot h+{b_d})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003ey\u003c/em\u003e is the output vector or tensor, \u003cem\u003eW\u003c/em\u003e\u003csub\u003e\u003cem\u003ed\u003c/em\u003e\u003c/sub\u003e is the weight matrix connecting the previous layer's outputs \u003cem\u003eh\u003c/em\u003e to the current layer's inputs, \u003cem\u003eh\u003c/em\u003e is the input vector or tensor from the previous layer, and \u003cem\u003eb\u003c/em\u003e\u003csub\u003e\u003cem\u003ed\u003c/em\u003e\u003c/sub\u003e is the bias vector added to the weighted sum before applying the activation function.\u003c/p\u003e \u003cp\u003eConversely, the CNN model was adopted for its effectiveness in analyzing spatial patterns within images, pertinent to the 300\u0026times;300 images with three channels used in this study. The model initiated with a Conv2D layer composed of 6 filters of size 3\u0026times;3 and strides of 4\u0026times;4, represented as:\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$${Y_{ij}}={\\sum _{m,n}}{X_{i+m,j}}_{{+n}} \\cdot {K_{mn}}+b$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eY\u003c/em\u003e\u003csub\u003e\u003cem\u003eij\u003c/em\u003e\u003c/sub\u003e is the output feature map at position (\u003cem\u003ei,j\u003c/em\u003e), \u003cem\u003eX\u003c/em\u003e\u003csub\u003e\u003cem\u003ei+m,j+n\u003c/em\u003e\u003c/sub\u003e are input values at relative positions, \u003cem\u003eK\u003c/em\u003e\u003csub\u003e\u003cem\u003emn\u003c/em\u003e\u003c/sub\u003e are convolutional filter weights, and \u003cem\u003eb\u003c/em\u003e is the bias term.\u003c/p\u003e \u003cp\u003eFollowed by batch normalization and ReLU activation to accelerate training and introduce non-linearity:\u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e\n$$\\begin{gathered} {Y_{normalized}}=\\frac{{Y - \\mu }}{{\\sqrt {{\\sigma ^2}+\\epsilon } }} \\hfill \\\\ {Y_{ReLU}}=max(0,{Y_{normalized}}) \\hfill \\\\ \\end{gathered}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eSubsequent max-pooling layers reduced dimensionality and emphasized salient features, while the sequence concluded with a flattening step, a dropout layer with a rate of 0.3, and a dense layer of 50 neurons with ReLU activation [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e] and He normal initialization, further contributing to robust feature extraction. These hyperparameters were also obtained via the grid search optimization.\u003c/p\u003e \u003cp\u003eThe outputs from the LSTM and CNN models were concatenated, capitalizing on their synergistic strengths, followed by two dense layers with 40 and 2 units, respectively. The latter employed a SoftMax activation function[\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e], enabling probabilistic interpretation of the model's predictions:\u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e\n$$y=SoftMax(W \\cdot [{h_{LSTM}},{h_{CNN}}]+b)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e6\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe combined model was compiled with the Adam optimizer[\u003cspan additionalcitationids=\"CR27\" citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e] at a learning rate of 0.0001 and categorical cross-entropy loss function[\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e, \u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e], optimizing for multi-class classification performance:\u003c/p\u003e \u003cp\u003e \u003cspan class=\"InlineEquation\"\u003e \u003c/span\u003e \u003cspan class=\"InlineEquation\"\u003e \u003cspan class=\"mathinline\"\u003e\$L= - \\frac{1}{N}\\sum\\nolimits_{{i=1}}^{N} {{y_i}log\\left( {\\widehat {{{y_i}}}} \\right)}\$\u003c/span\u003e \u003c/span\u003e (7)\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003ey\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e denotes the true label, \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\${\\widehat{y}}_{i}\$\u003c/span\u003e\u003c/span\u003e denotes the predicted label for each sample in the batch of size N.\u003c/p\u003e \u003cp\u003eModel parameters were saved and loaded from the disk, enhancing reproducibility, and allowing for further utility. For training, an iterative tokenization procedure was applied to the training and validation datasets across a sequence of times, aligning with the sequential nature of the data. The combined model was fit for 2000 epochs with a batch size of 10, balancing the trade-off between computational efficiency and convergence stability. Following training, the model underwent evaluation on a test dataset, and various functionalities were deployed, including saving, loading best models, and executing predictions with the optimally performing model. Additionally, a series of utility functions were employed to perform crucial tasks such as validation of SMILES strings, generation, and canonical conversion of specific SMILES strings, pairwise similarity computation, prediction using generated SMILES strings, simple moving average calculation, reward calculation, and similarity and canonical checks on generated strings. These functions not only enriched the model's interpretive capability but also facilitated a more nuanced assessment and interpretation of predictions concerning biodegradable and non-biodegradable chemical structures. Collectively, the integrated methodology provided a robust framework for predictive analysis, merging sequence understanding with spatial pattern recognition and supporting comprehensive validation and interpretive analysis.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3 Generator model building\u003c/h2\u003e \u003cp\u003eDeveloping novel biodegradable materials is crucial in modern materials science, contributing to sustainable development and environmental protection. In this research, a methodology is constructed leveraging reinforcement learning (RL), uniquely suited to this task due to its ability to explore and optimize complex, high-dimensional spaces. The RL model consists of three primary components: the generator, predictor, and reward function, each with distinct implications. (1) \u003cb\u003eGenerator\u003c/b\u003e: Utilized for generating molecular trajectories, the generator, adopted from Popova et al.[\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e] is the core of the explorative aspect of the RL framework. It symbolizes the ability to propose new molecular structures in the search space, allowing the discovery of potentially novel biodegradable materials. The generator model is a stack-augmented RNN developed using PyTorch. It consists of an Embedding layer to translate the input \u003cem\u003ex\u003c/em\u003e into continuous space, \u003cem\u003ee(x)\u003c/em\u003e, facilitating the nuanced processing of molecular structures and understanding complex relations within the molecules. The gated recurrent unit (GRU) is employed, whose update and reset gates are governed by:\u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e\n$$\\begin{gathered} {r_t}=\\sigma ({W_r} \\cdot [{h_t} - 1,{x_t}]+{b_r}) \\hfill \\\\ {z_t}=\\sigma ({W_z} \\cdot [{h_t} - 1,{x_{t}}]+{b_z}) \\hfill \\\\ \\end{gathered}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e8\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eand its hidden state by:\u003cdiv id=\"Equ8\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ8\" name=\"EquationSource\"\u003e\n$${h_t}_{}=(1 - {z_t}) \\odot {h_t}_{{ - 1}}+{z_t} \\odot \\widetilde {{{h_t}}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e9\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eσ\u003c/em\u003e is the sigmoid function, to enhance the handling of sequential data and SMILES representations, vital for capturing temporal dependencies in molecular design. \u003cem\u003er\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003ez\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e denote the reset and update gates at time t, \u003cem\u003eh\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u0026minus;1\u003c/em\u003e\u003c/sub\u003e is the previous hidden state, \u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e is the current input, \u003cem\u003eW\u003c/em\u003e\u003csub\u003e\u003cem\u003er\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eW\u003c/em\u003e\u003csub\u003e\u003cem\u003ez\u003c/em\u003e\u003c/sub\u003e, \u003cem\u003eb\u003c/em\u003e\u003csub\u003e\u003cem\u003er\u003c/em\u003e\u003c/sub\u003e, and \u003cem\u003eb\u003c/em\u003e\u003csub\u003e\u003cem\u003ez\u003c/em\u003e\u003c/sub\u003e are the weight matrices and bias terms.\u003c/p\u003e \u003cp\u003eAn innovative feature of this model is the stack augmentation mechanism, which is central to generating diverse and complex molecular trajectories. The stack operation equations, governed by push, pop, and no-op controls, enable flexible and intelligent manipulation of the stack structure. The decoder, coupled with LogSoftmax activation, translates the GRU's output and ensures normalization, fundamental for accurate prediction and selection of the next molecular character.\u003cdiv id=\"Equ9\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ9\" name=\"EquationSource\"\u003e\n$${y_t}=LogSoftmax({W_o} \\cdot {h_t}+{b_{o}})$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e10\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eW\u003c/em\u003e\u003csub\u003e\u003cem\u003eo\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003eb\u003c/em\u003e\u003csub\u003e\u003cem\u003eo\u003c/em\u003e\u003c/sub\u003e are the weight matrix and bias term of the output layer, and \u003cem\u003ey\u003c/em\u003e\u003csub\u003e\u003cem\u003et\u003c/em\u003e\u003c/sub\u003e is the predicted output at time t.\u003c/p\u003e \u003cp\u003eThe training and evaluation functions encapsulate the learning process, which is essential for adapting the model to generate desired molecular structures. The loss is computed using the Cross-Entropy Loss function. Additionally, various utilities, including changing the learning rate and handling stack operations, enhance the flexibility and efficiency of the model. (2) \u003cb\u003ePredictor\u003c/b\u003e: Previously described in Section 2.2, this component evaluates the generated trajectories, functioning as the evaluative mechanism within the RL environment. It serves as the scientific bridge between the mathematical formulations of RL and the physical properties of molecules, providing tangible feedback based on generated molecular structures. (3) \u003cb\u003eReward Function\u003c/b\u003e: Computing the reward based on the generated sequence of molecular structures, the reward function plays a critical role in guiding the learning process. By quantifying the value of each structure in terms of biodegradability, ensures that the learning process aligns with the ultimate scientific goal of the research. Herein, a high reward is assigned if the generated material is not in the training data and is biodegradable. This allows the weights for the newly generated model to be updated. The reward is expressed as \u003cem\u003eR(s, a)\u003c/em\u003e, where \u003cem\u003es\u003c/em\u003e denotes the state, and \u003cem\u003ea\u003c/em\u003e denotes the action taken.\u003c/p\u003e \u003cp\u003eThe policy gradient method is applied, vital for continuous, high-dimensional action spaces common in molecular design. This method maximizes the expected cumulative reward [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e, \u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e, \u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e], emphasizing the trajectories that lead to the most promising materials, according to the following equation:\u003cdiv id=\"Equ10\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ10\" name=\"EquationSource\"\u003e\n$${\\nabla _\\theta }J\\left( \\theta \\right)={{\\rm E}_{\\pi \\theta }}\\left[ {{\\nabla _\\theta }log{\\pi _\\theta }(a\\mid s){Q^\\pi }(s,a)} \\right]$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e11\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eθ\u003c/em\u003e is the policy parameter, \u003cem\u003eπ\u003c/em\u003e represents the policy, and \u003cem\u003eQ\u003c/em\u003e\u003csub\u003e\u003cem\u003eπ\u003c/em\u003e\u003c/sub\u003e is the action-value function.\u003c/p\u003e \u003cp\u003eUsing gradient clipping ensures stable and robust convergence by avoiding the exploding gradient problem. The clipped gradient [\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e] can be represented as:\u003cdiv id=\"Equ11\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ11\" name=\"EquationSource\"\u003e\n$${\\nabla _{clipped}}=min\\left( {\\nabla ,\\frac{\\nabla }{{\\parallel \\nabla \\parallel }} \\times threshold} \\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e12\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eThe iterative process, involving policy replay and updates, illustrates RL's dynamism. The update rule can be expressed using the Bellman equation[\u003cspan additionalcitationids=\"CR35\" citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR36\" class=\"CitationRef\"\u003e36\u003c/span\u003e]:\u003cdiv id=\"Equ12\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ12\" name=\"EquationSource\"\u003e\n$$Q\\left( {s,a} \\right) \\leftarrow \\left( {1 - \\alpha } \\right)Q\\left( {s,a} \\right)+\\alpha \\left( {r+\\gamma ma{x_{a\\prime }}Q\\left( {s\\prime ,a\\prime } \\right)} \\right)$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e13\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eα\u003c/em\u003e is the learning rate, and \u003cem\u003eγ\u003c/em\u003e is the discount factor.\u003c/p\u003e \u003cp\u003eFurthermore, evaluating the generated SMILES strings' validity and canonicity ensures that the generated molecular structures are not only novel but also chemically accurate and practically feasible. Lastly, converting valid canonical SMILES into canonical form, and the subsequent visualization, encapsulates the synthesis of theoretical findings with practical applications, bridging computational discoveries with real-world chemical representations. The overall schematic representation of the proposed solution strategy is presented in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4 Hyperparameter optimization\u003c/h2\u003e \u003cp\u003eThe grid search method[\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e] was employed to enhance the accuracy of the models, thereby eliminating the possibility of obtaining suboptimal models generally obtained via the conventional trial-and-error approach to model finetuning [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e, \u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e]. The grid search methodology represents a fundamental algorithmic approach for hyperparameter tuning[\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e]. In essence, we partition the domain of the hyperparameters into a discretized grid. Next, we systematically explore all possible permutations of values within this grid while concurrently evaluating various performance metrics through cross-validation. The grid point that yields the highest average value during cross-validation represents the optimal configuration of hyperparameters. Grid search is a meticulous algorithm that comprehensively explores all possible combinations, thereby enabling the identification of the optimal point within the given domain[\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e]. The significant limitation lies in its notably sluggish learning rate. Performing a comprehensive exploration of all spatial configurations necessitates a substantial amount of time. Acknowledging that each point within the grid necessitates k-fold cross-validation, a process that entails k-training iterations[\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e]. Thus, optimizing the hyperparameters of a model using this methodology can present significant intricacies and costs. However, exploring the synergistic effects of hyperparameters in pursuing optimal performance is prudent, where a grid search is a superior approach in this endeavor. The range of hyperparameters explored is presented in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eParameters used in the hyperparameter optimization.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"2\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eType\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eRange or candidates\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNumber of epochs\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u0026ndash;10000\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNumber of neurons in the first layer\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u0026ndash;500\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eNumber of neurons in the second layer\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u0026ndash;100\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eActivation functions\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSigmoid, Hyperbolic tangent function, ReLU, leaky ReLU, ELU, and SELU\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLearning rates\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0.1, 0.05, 0.04, 0.03, 0.02, 0.01, 0.007, 0.005, 0.003, and 0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLoss functions\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eMSE and MAE\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOptimizer\u003csup\u003e\u0026dagger;\u003c/sup\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAdaGrad, Adam, AdaMax, and Nadam\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003ctfoot\u003e \u003ctr\u003e\u003ctd colspan=\"2\"\u003e\u003csup\u003e\u0026dagger;\u003c/sup\u003e [\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e, \u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e], \u003csup\u003e\u0026Dagger;\u003c/sup\u003e[\u003cspan additionalcitationids=\"CR44\" citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e]\u003c/td\u003e\u003c/tr\u003e \u003c/tfoot\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"3. Results and Discussion","content":"\u003cp\u003eThis section presents and discusses the findings from our proposed model and its comparison with previous models. Testing of the model on the electrolyte dataset to establish its generalizability and novel material discovery potential is also presented.\u003c/p\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n \u003ch2\u003e3.1. Model validation\u003c/h2\u003e\n \u003cp\u003eIn Fig. \u003cspan class=\"InternalRef\"\u003e3\u003c/span\u003e, we demonstrated the prediction capability of our integrated model in terms of training and validation datasets and compared them with the CNN- and LSTM-only models. It is worth noting that these models were optimized using the grid search method presented in Section 2.4. The proposed model exhibited a very high accuracy of 87.2% compared to 75.4% and 79.3% for the CNN and LSTM-only models, ascribed to the proposed model\u0026rsquo;s ability to learn both spatial and temporal dependencies in the SMILES data, enhancing its capability to efficiently predict the biodegradability of the organic materials. The CNN-only model significantly overfits the model, as revealed in its higher training accuracy (AUC) of 0.981, albeit with a much lower test accuracy of 0.834 AUC. The LSTM-only model, on the other hand, outperformed the CNN-only model in terms of generalizability, achieving a test score of 0.892 AUC, but underperformed the integrated model. The accuracy of the proposed integrated model can be further improved by increasing the number of molecules in the dataset, which is only 1055 in this case. It is worth noting that the hyperparameters employed in these models were optimized using the grid method elaborated in Section 2.4.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n \u003ch2\u003e3.2 Novel Material Discovery\u003c/h2\u003e\n \u003cp\u003eGated recurrent unit (GRU), which has a similar structure to LSTM, enables faster SMILES generation [\u003cspan class=\"CitationRef\"\u003e46\u003c/span\u003e]. The GRU predicts the next character based on the current input character and hidden state. This allows for the generation of new compounds or materials. To train the GRU model, \u0026apos;\u0026lt;\u0026apos; is added at the beginning and \u0026apos;\u0026gt;\u0026apos; at the end of each SMILES sequence. This modification ensures that the model learns to recognize the start and end points of the SMILES sequences during training. This is demonstrated in this example: \u0026lt; [O-][N+](c1c(Cl)ccc([N+]([O-])\u0026thinsp;=\u0026thinsp;O)c1)\u0026thinsp;=\u0026thinsp;O \u0026gt;.\u003c/p\u003e\n \u003cp\u003eFrom our results, approximately 60% of the generated SMILES are valid, indicating that most of them adhere to the structural rules governing the SMILES notation. In addition, more than 80% of these generated SMILES are distinct from the compounds in the training dataset, highlighting their novel characteristics and exploratory potential. Examples of the generated materials are presented in Fig. 4. Nevertheless, only about 40% of the generated SMILES were biodegradable (Fig. 5a). Hence, the biodegradable material generation ability of the model must be enhanced.\u003c/p\u003e\n \u003cp\u003eBy leveraging the developed biodegradability prediction model and the GRU-based SMILES generator, we can harness the power of RL to explore and discover novel organic compounds with inherent biodegradable properties. Here, a high reward is assigned if the generated material is not in the training data and is biodegradable. This allows the weights of the newly generated model to be updated. Therefore, we next present the results of the final generator model integrated with RL. The generative model has been successfully trained through the RL to discover more biodegradable compounds. Compared to the GRU-based generative model without RL, the final model could discover about 95% of biodegradable materials (Fig. 5(a) and (b)), of which 42% is not present in the original training dataset, demonstrating the novel material discovery capability of our model. By incorporating constraints on the similarity between specific functional groups/atoms and the generated compounds, we could generate diverse materials while preserving specific functional group/atom characteristics (Fig. 5c).\u003c/p\u003e\n \u003cp\u003eNext, we compared our model with the state-of-art model proposed by Popova et al.[\u003cspan class=\"CitationRef\"\u003e31\u003c/span\u003e] for De novo drug design and obtained superior prediction results in terms of ROC, RMSE, and MAPE. Notably, our model significantly outperformed the state-of-the-art model with a training AUC of 0.974 and a testing AUC of 0.916 compared to 0.913 and 0.891, respectively. This result is ascribed to the proposed model\u0026rsquo;s ability to learn both spatial and temporal dependencies in the trained data set using the CNN-LSTM integrated model. It is worth mentioning that the CNN component of the proposed model leads to overfitting (wider gap between the train and test results) and thus was carefully optimized to yield the expected synergistic effect. In terms of computational time, both models achieve nearly the same training time of 2 hours and 20 minutes, indicating that no computing burden is incurred by the proposed model albeit with better accuracy.\u003c/p\u003e\n \u003cp\u003eIn addition, we tested the capability of the developed model in designing novel electrolytes as part of this research, recognizing their significance in today\u0026apos;s world. Electrolytes are fundamental components in numerous applications, spanning from energy storage systems like batteries and supercapacitors to vital functions in biological systems. We focused on developing an electrolyte with specific properties such as low viscosity, high conductivity, and cost-effectiveness. To achieve this, we explored the molecular weights of the electrolytes (leveraging large PubChem data[\u003cspan class=\"CitationRef\"\u003e47\u003c/span\u003e\u0026ndash;\u003cspan class=\"CitationRef\"\u003e49\u003c/span\u003e]), which have a direct correlation with viscosity\u0026ndash; a primary determinant of electrolyte performance. Electrolytes are particularly instrumental in regulating ion transport within various electrochemical devices, making their viscosity a key factor in overall efficiency.\u003c/p\u003e\n \u003cp\u003eFor the prediction model, we utilized a normalized molecular weight distribution with a mean of 60 g/mol and a deviation of 0.7. This distribution allows us to assess the molecular weight of the generated chemical SMILES. If the weight is either too light or too heavy, the probability assigned to it based on the assumed distribution would be significantly low. We have incorporated this probability as a penalty and reward mechanism within the RL framework (Fig. \u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003ea). By incorporating these methodologies, we could develop a novel electrolyte that meets the desired criteria of low viscosity, high conductivity, and cost-effectiveness, contributing to the advancement in organic materials and their applications.\u003c/p\u003e\n \u003cp\u003eFrom Fig. \u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003eb, the desired lithium-based electrolytes with the expected molecular weight were generated using the model presented in Fig. \u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003ea. The distribution in Fig. \u003cspan class=\"InternalRef\"\u003e7\u003c/span\u003ec clearly indicates that after incorporating the RL, we could achieve a much wider molecular weight distribution. Herein, specific or desired molecular weight compounds could be generated by feeding such information to the RL framework, drastically improving the design of experiment approaches and achieving facile material discovery.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"4. Conclusion","content":"\u003cp\u003eThis study presents a comprehensive analysis of our proposed integrated model for biodegradability prediction and novel material discovery. The model's predictive capabilities were validated, demonstrating superior performance compared to CNN- and LSTM-only models. The integrated model achieved an impressive 87.2% AUC, showcasing its ability to learn spatial and temporal dependencies in SMILES data. Our novel material discovery approach, utilizing a GRU-based SMILES generator within a reinforcement learning framework, showed significant potential. Around 60% of the generated SMILES were valid, and over 80% were distinct from the training dataset, indicating their novelty. Moreover, through RL, we enhanced the model's ability to generate biodegradable materials, with approximately 95% being biodegradable, including 42% not present in the original training dataset. Furthermore, we compared our model to a state-of-the-art model proposed for De novo drug design and achieved superior results in terms of ROC, highlighting the model's potential in diverse applications. Expanding the scope of our research to the design of novel electrolytes by employing large-scale molecular data, we developed a novel electrolyte with specific properties like low viscosity, high conductivity, and cost-effectiveness, contributing to the advancement of organic materials and their applications. Our integrated model has shown exceptional promise in biodegradability prediction, material discovery, and electrolyte design. Future work could further enhance the model's capabilities and explore its applications in various material discovery fields. This research represents a significant step towards leveraging artificial intelligence for material discovery and design in today's dynamic scientific landscape.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAcknowledgment\u003c/h2\u003e \u003cp\u003eThis research was supported by INHA UNIVERSITY Research\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eF. Wu, M. Misra, A.K. Mohanty, Challenges and new opportunities on barrier performance of biodegradable polymers for sustainable packaging, Prog Polym Sci. 117 (2021) 101395. https://doi.org/10.1016/j.progpolymsci.2021.101395.\u003c/li\u003e\n \u003cli\u003eR. Grace, Closing the Circle: Reshaping How Products are Conceived \u0026amp;amp; Made, Plastics Engineering. 73 (2017) 8\u0026ndash;11. https://doi.org/10.1002/j.1941-9635.2017.tb01670.x.\u003c/li\u003e\n \u003cli\u003eF. Allen, J. Gasparro, J. Swaney, M. Phelan, J. Gillespie, Directive 2004/38/EC of the European Parliament and of the Council of 29 April 2004, Immigration Law Handbook. (2023) 2253-C79P212. https://doi.org/10.1093/oso/9780192896292.003.0079.\u003c/li\u003e\n \u003cli\u003eTest No. 301: Ready Biodegradability, OECD, 1992. https://doi.org/10.1787/9789264070349-en.\u003c/li\u003e\n \u003cli\u003eIdentification of biodegradation models under model and data uncertainty, Water Science and Technology. 33 (1996). https://doi.org/10.1016/0273-1223(96)00192-8.\u003c/li\u003e\n \u003cli\u003eP.G. Polishchuk, T.I. Madzhidov, A. Varnek, Estimation of the size of drug-like chemical space based on GDB-17 data, J Comput Aided Mol Des. 27 (2013) 675\u0026ndash;679. https://doi.org/10.1007/s10822-013-9672-4.\u003c/li\u003e\n \u003cli\u003eD. Weininger, SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules, J Chem Inf Comput Sci. 28 (1988) 31\u0026ndash;36. https://doi.org/10.1021/ci00057a005.\u003c/li\u003e\n \u003cli\u003eC. Bilodeau, W. Jin, T. Jaakkola, R. Barzilay, K.F. Jensen, Generative models for molecular discovery: Recent advances and challenges, WIREs Computational Molecular Science. 12 (2022). https://doi.org/10.1002/wcms.1608.\u003c/li\u003e\n \u003cli\u003eM. Olivecrona, T. Blaschke, O. Engkvist, H. Chen, Molecular de-novo design through deep reinforcement learning, J Cheminform. 9 (2017) 48. https://doi.org/10.1186/s13321-017-0235-x.\u003c/li\u003e\n \u003cli\u003eP.-H. Chiu, Y.-L. Yang, H.-K. Tsao, Y.-J. Sheng, Deep learning for predictions of hydrolysis rates and conditional molecular design of esters, J Taiwan Inst Chem Eng. 126 (2021) 1\u0026ndash;13. https://doi.org/10.1016/j.jtice.2021.06.045.\u003c/li\u003e\n \u003cli\u003eM. Wang, C.-Y. Hsieh, J. Wang, D. Wang, G. Weng, C. Shen, X. Yao, Z. Bing, H. Li, D. Cao, T. Hou, RELATION: A Deep Generative Model for Structure-Based De Novo Drug Design, J Med Chem. 65 (2022) 9478\u0026ndash;9492. https://doi.org/10.1021/acs.jmedchem.2c00732.\u003c/li\u003e\n \u003cli\u003eJ. Ar\u0026uacute;s-Pous, A. Patronov, E.J. Bjerrum, C. Tyrchan, J.-L. Reymond, H. Chen, O. Engkvist, SMILES-based deep generative scaffold decorator for de-novo drug design, J Cheminform. 12 (2020) 38. https://doi.org/10.1186/s13321-020-00441-8.\u003c/li\u003e\n \u003cli\u003eN. De Cao, T. Kipf, MolGAN: An implicit generative model for small molecular graphs, ArXiv. abs/1805.1 (2018) null. https://www.semanticscholar.org/paper/def1049b5aae96c8e1eab0ca58d77ac9c2f0e3e9.\u003c/li\u003e\n \u003cli\u003eW. Tang, Y. Li, Y. Yu, Z. Wang, T. Xu, J. Chen, J. Lin, X. Li, Development of models predicting biodegradation rate rating with multiple linear regression and support vector machine algorithms, Chemosphere. 253 (2020) 126666. https://doi.org/10.1016/j.chemosphere.2020.126666.\u003c/li\u003e\n \u003cli\u003eO. Dollar, N. Joshi, D.A.C. Beck, J. Pfaendtner, Attention-based generative models for de novo molecular design, Chem Sci. 12 (2021) 8362\u0026ndash;8372. https://doi.org/10.1039/d1sc01050f.\u003c/li\u003e\n \u003cli\u003eF. Lunghini, G. Marcou, P. Gantzer, P. Azam, D. Horvath, E. Van Miert, A. Varnek, Modelling of ready biodegradability based on combined public and industrial data sources, SAR QSAR Environ Res. 31 (2019) 171\u0026ndash;186. https://doi.org/10.1080/1062936x.2019.1697360.\u003c/li\u003e\n \u003cli\u003eW.F.C. Rocha, D.A. Sheen, Classification of biodegradable materials using QSAR modelling with uncertainty estimation, SAR QSAR Environ Res. 27 (2016) 799\u0026ndash;811. https://doi.org/10.1080/1062936X.2016.1238010.\u003c/li\u003e\n \u003cli\u003eK. Acharya, D. Werner, J. Dolfing, M. Barycki, P. Meynet, W. Mrozik, O. Komolafe, T. Puzyn, R.J. Davenport, A quantitative structure-biodegradation relationship (QSBR) approach to predict biodegradation rates of aromatic chemicals, Water Res. 157 (2019) 181\u0026ndash;190. https://doi.org/10.1016/j.watres.2019.03.086.\u003c/li\u003e\n \u003cli\u003eR.T.B.D.T.R. Mansouri Kamel, V. Consonni, QSAR biodegradation, (2013).\u003c/li\u003e\n \u003cli\u003eP. Dey, S.K. Chaulya, S. Kumar, Hybrid CNN-LSTM and IoT-based coal mine hazards monitoring and prediction system, Process Safety and Environmental Protection. 152 (2021) 249\u0026ndash;263. https://doi.org/10.1016/J.PSEP.2021.06.005.\u003c/li\u003e\n \u003cli\u003eY. Zhao, Improvement and Application of Multi-layer LSTM Algorithm Based on Spatial-Temporal Correlation, Ing\u0026eacute;nierie Des Syst\u0026egrave;mes d Inf. 25 (2020) null. https://doi.org/10.18280/isi.250107.\u003c/li\u003e\n \u003cli\u003eC. Ding, G. Wang, X. Zhang, Q. Liu, X. Liu, A hybrid CNN-LSTM model for predicting PM2.5 in Beijing based on spatiotemporal correlation, Environ Ecol Stat. 28 (2021) 503\u0026ndash;522. https://doi.org/10.1007/s10651-021-00501-8.\u003c/li\u003e\n \u003cli\u003eD.Q. Gbadago, J. Moon, M. Kim, S. Hwang, A unified framework for the mathematical modelling, predictive analysis, and optimization of reaction systems using computational fluid dynamics, deep neural network and genetic algorithm: A case of butadiene synthesis, Chemical Engineering Journal. 409 (2021) 128163. https://doi.org/10.1016/j.cej.2020.128163.\u003c/li\u003e\n \u003cli\u003eJ. Moon, D.Q. Gbadago, G. Hwang, D. Lee, S. Hwang, Software platform for high-fidelity-data-based artificial neural network modeling and process optimization in chemical engineering, Comput Chem Eng. 158 (2022) 107637. https://doi.org/10.1016/J.COMPCHEMENG.2021.107637.\u003c/li\u003e\n \u003cli\u003eP. Dey, K. Saurabh, C. Kumar, D. Pandit, S.K. Chaulya, S. Ray, G.M. Prasad, S.K. Mandal, t-SNE and variational auto-encoder with a bi-LSTM neural network-based model for prediction of gas concentration in a sealed-off area of underground coal mines, Soft Comput. 25 (2021) 14183\u0026ndash;14207. https://doi.org/10.1007/s00500-021-06261-8.\u003c/li\u003e\n \u003cli\u003eW. Wang, A Pre-trained Conditional Transformer for Target-speciﬁc De Novo Molecular Generation, (2022). https://www.semanticscholar.org/paper/ed9763062daec0eec7ceb65e822360e340c75605.\u003c/li\u003e\n \u003cli\u003eX. Yang, Z. Zhang, An attention-based domain spatial-temporal meta-learning (ADST-ML) approach for PM2.5 concentration dynamics prediction, Urban Clim. null (2023) null. https://doi.org/10.1016/j.uclim.2022.101363.\u003c/li\u003e\n \u003cli\u003eN. Xu, X. Wang, X. Meng, H. Chang, Gas Concentration Prediction Based on IWOA-LSTM-CEEMDAN Residual Correction Model, Sensors (Basel). 22 (2022) null. https://doi.org/10.3390/s22124412.\u003c/li\u003e\n \u003cli\u003eL. Pingyang, N. Chen, M. Shanjun, L. Mei, LSTM based encoder-decoder for short-term predictions of gas concentration using multi-sensor fusion, Process Safety and Environmental Protection. 137 (2020) 93\u0026ndash;105. https://doi.org/10.1016/j.psep.2020.02.021.\u003c/li\u003e\n \u003cli\u003eK. Kumari, P. Dey, C. Kumar, D. Pandit, S. Mishra, V. Kisku, S.K. Chaulya, S. Ray, G.M. Prasad, UMAP and LSTM based fire status and explosibility prediction for sealed-off area in underground coal mine, Process Safety and Environmental Protection. 146 (2021) 837\u0026ndash;852. https://doi.org/10.1016/j.psep.2020.12.019.\u003c/li\u003e\n \u003cli\u003eM. Popova, O. Isayev, A. Tropsha, Deep reinforcement learning for de novo drug design, Sci Adv. 4 (2018) eaap7885\u0026ndash;eaap7885. https://doi.org/10.1126/sciadv.aap7885.\u003c/li\u003e\n \u003cli\u003eM. Popova, M. Shvets, J.B. Oliva, O. Isayev, MolecularRNN: Generating realistic molecular graphs with optimized properties, ArXiv. abs/1905.1 (2019) null. https://www.semanticscholar.org/paper/3ccd291c8848c73ca34152e27c3ec296cfc838d0.\u003c/li\u003e\n \u003cli\u003eZ. Zhou, S. Kearnes, L. Li, R. Zare, P.F. Riley, Optimization of Molecules via Deep Reinforcement Learning, Sci Rep. 9 (2018) null. https://doi.org/10.1038/s41598-019-47148-x.\u003c/li\u003e\n \u003cli\u003eBellman-consistent Pessimism for Offline Reinforcement Learning | OpenReview, (n.d.). https://openreview.net/forum?id=e8WWUBeafM (accessed October 10, 2023).\u003c/li\u003e\n \u003cli\u003eB. O\u0026rsquo;donoghue, I. Osband, R. Munos, V. Mnih, The Uncertainty Bellman Equation and Exploration, (2018).\u003c/li\u003e\n \u003cli\u003eY. Fei, Z. Yang, Y. Chen, Z. Wang, Exponential Bellman Equation and Improved Regret Bounds for Risk-Sensitive Reinforcement Learning, (n.d.).\u003c/li\u003e\n \u003cli\u003eH.A. Fayed, A.F. Atiya, Speed up grid-search for parameter selection of support vector machines, Appl Soft Comput. 80 (2019) 202\u0026ndash;210. https://doi.org/10.1016/J.ASOC.2019.03.037.\u003c/li\u003e\n \u003cli\u003eS.M. LaValle, M.S. Branicky, S.R. Lindemann, On the Relationship between Classical Grid Search and Probabilistic Roadmaps, Http://Dx.Doi.Org/10.1177/0278364904045481. 23 (2004) 673\u0026ndash;692. https://doi.org/10.1177/0278364904045481.\u003c/li\u003e\n \u003cli\u003eP. Liashchynskyi, P. Liashchynskyi, Grid Search, Random Search, Genetic Algorithm: A Big Comparison for NAS, (2019). https://arxiv.org/abs/1912.06059v1 (accessed October 11, 2023).\u003c/li\u003e\n \u003cli\u003eF.J. Pontes, G.F. Amorim, P.P. Balestrassi, A.P. Paiva, J.R. Ferreira, Design of experiments and focused grid search for neural network parameter optimization, Neurocomputing. 186 (2016) 22\u0026ndash;34. https://doi.org/10.1016/J.NEUCOM.2015.12.061.\u003c/li\u003e\n \u003cli\u003eR.Y. Acharya, N.F. Charlot, M.M. Alam, F. Ganji, D. Gauthier, D. Forte, Chaogate parameter optimization using bayesian optimization and genetic algorithm, Proceedings - International Symposium on Quality Electronic Design, ISQED. 2021-April (2021) 426\u0026ndash;431. https://doi.org/10.1109/ISQED51717.2021.9424355.\u003c/li\u003e\n \u003cli\u003eH. Alibrahim, S.A. Ludwig, Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization, IEEE Congress on Evolutionary Computation (CEC). (2021) 1551\u0026ndash;1559. https://doi.org/10.1109/cec45853.2021.9504761.\u003c/li\u003e\n \u003cli\u003eY. Shin, Z. Kim, J. Yu, G. Kim, S. Hwang, Development of NOx reduction system utilizing artificial neural network (ANN) and genetic algorithm (GA), J Clean Prod. 232 (2019) 1418\u0026ndash;1429. https://doi.org/10.1016/j.jclepro.2019.05.276.\u003c/li\u003e\n \u003cli\u003eD.Q. Gbadago, J. Moon, M. Kim, S. Hwang, A unified framework for the mathematical modelling, predictive analysis, and optimization of reaction systems using computational fluid dynamics, deep neural network and genetic algorithm: A case of butadiene synthesis, Chemical Engineering Journal. 409 (2021) 128163. https://doi.org/10.1016/j.cej.2020.128163.\u003c/li\u003e\n \u003cli\u003eF. Mohammadi, M.R. Samaei, A. Azhdarpoor, H. Teiri, A. Badeenezhad, S. Rostami, Modelling and Optimizing Pyrene Removal from the Soil by Phytoremediation using Response Surface Methodology, Artificial Neural Networks, and Genetic Algorithm, Chemosphere. 237 (2019) 124486. https://doi.org/10.1016/j.chemosphere.2019.124486.\u003c/li\u003e\n \u003cli\u003eB. Athiwaratkun, J.W. Stokes, Malware classification with LSTM and GRU language models and a character-level CNN, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings. (2017) 2482\u0026ndash;2486. https://doi.org/10.1109/ICASSP.2017.7952603.\u003c/li\u003e\n \u003cli\u003eS. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B.A. Shoemaker, P.A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E.E. Bolton, PubChem 2023 update, Nucleic Acids Res. 51 (2023) D1373\u0026ndash;D1380. https://doi.org/10.1093/NAR/GKAC956.\u003c/li\u003e\n \u003cli\u003eV.D. H\u0026auml;hnke, S. Kim, E.E. Bolton, PubChem chemical structure standardization, J Cheminform. 10 (2018). https://doi.org/10.1186/S13321-018-0293-8.\u003c/li\u003e\n \u003cli\u003eS. Kim, J. Chen, T. Cheng, A. Gindulyte, J. He, S. He, Q. Li, B.A. Shoemaker, P.A. Thiessen, B. Yu, L. Zaslavsky, J. Zhang, E.E. Bolton, PubChem 2019 update: improved access to chemical data, Nucleic Acids Res. 47 (2019) D1102\u0026ndash;D1109. https://doi.org/10.1093/NAR/GKY1033.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":true,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"korean-journal-of-chemical-engineering","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"kjce","sideBox":"Learn more about [Korean Journal of Chemical Engineering](http://link.springer.com/journal/11814)","snPcode":"11814","submissionUrl":"https://www.editorialmanager.com/kjce/default2.aspx","title":"Korean Journal of Chemical Engineering","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Subscription","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Biodegradability, SMILES, Green chemistry","lastPublishedDoi":"10.21203/rs.3.rs-4002218/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4002218/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThe increasing global demand for eco-friendly products is driving innovation in sustainable chemical synthesis, particularly the development of biodegradable substances. Herein, a novel method utilizing artificial intelligence (AI) to predict the biodegradability of organic compounds is presented, overcoming the limitations of traditional prediction methods that rely on laborious and costly density functional theory (DFT) calculations. We propose leveraging readily available molecular formulas and structures represented by simplified molecular-input line-entry system (SMILES) notation and molecular images to develop an effective AI-based prediction model using state-of-the-art machine learning techniques, including deep convolutional neural networks (CNN) and long-short term memory (LSTM) learning algorithms, capable of extracting meaningful molecular features and spatiotemporal relationships. The model is further enhanced with reinforcement learning (RL) to better predict and discover new biodegradable materials by rewarding the system for identifying unique and biodegradable compounds. The combined CNN-LSTM model achieved an 87.2% prediction accuracy, outperforming CNN- (75.4%) and LSTM-only (79.3%) models. The RL-assisted generator model produced approximately 60% valid SMILES structures, with over 80% being unique to the training dataset, demonstrating the model's capability to generate novel compounds with potential for practical application in sustainable chemistry. The model was extended to develop novel electrolytes with desired molecular weight distribution.\u003c/p\u003e","manuscriptTitle":"Deep Learning for Green Chemistry: An AI-Enabled Pathway for Biodegradability Prediction and Organic Material Discovery","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-03-18 17:47:27","doi":"10.21203/rs.3.rs-4002218/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Major Revisions Needed","date":"2024-04-07T21:34:47+00:00","index":"","fulltext":""},{"type":"reviewerAgreed","content":"","date":"2024-03-15T06:29:41+00:00","index":0,"fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-03-14T01:24:16+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-03-04T13:58:05+00:00","index":"","fulltext":""},{"type":"submitted","content":"Korean Journal of Chemical Engineering","date":"2024-02-29T01:49:56+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"korean-journal-of-chemical-engineering","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"kjce","sideBox":"Learn more about [Korean Journal of Chemical Engineering](http://link.springer.com/journal/11814)","snPcode":"11814","submissionUrl":"https://www.editorialmanager.com/kjce/default2.aspx","title":"Korean Journal of Chemical Engineering","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Subscription","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"287a9890-b3a2-44ee-b960-014e6d38de26","owner":[],"postedDate":"March 18th, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-06-21T15:12:10+00:00","versionOfRecord":{"articleIdentity":"rs-4002218","link":"https://doi.org/10.1007/s11814-024-00202-5","journal":{"identity":"korean-journal-of-chemical-engineering","isVorOnly":false,"title":"Korean Journal of Chemical Engineering"},"publishedOn":"2024-06-12 15:12:10","publishedOnDateReadable":"June 12th, 2024"},"versionCreatedAt":"2024-03-18 17:47:27","video":"","vorDoi":"10.1007/s11814-024-00202-5","vorDoiUrl":"https://doi.org/10.1007/s11814-024-00202-5","workflowStages":[]},"version":"v1","identity":"rs-4002218","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4002218","identity":"rs-4002218","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0