A transferable fragment dictionary for eco-safety and environmental risk assessment

preprint OA: closed
Full text JSON View at publisher
Full text 122,653 characters · extracted from preprint-html · click to expand
A transferable fragment dictionary for eco-safety and environmental risk assessment | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article A transferable fragment dictionary for eco-safety and environmental risk assessment Chen-Chen Zhao, Shaoyi Hou, Cheng Fu, Peng Peng, Guoqiang Wang, and 3 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7617003/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Developing environmentally safe and sustainable chemicals is essential for maintaining the ecological integrity. Here we introduced a new molecular representation, atomic charge enhanced Fragment Dictionary (eFragD), to quantify fragment-level contributions to molecular biological effects, including acute toxicity and cyanocidal activity. Unlike conventional fragment-based approaches, eFragD incorporates density functional theory (DFT)-derived atomic charge information to enhance structural sensitivity, improving predictive specificity and interpretability. Applied to a dataset of 7,804 compounds, the framework identified 19 high-risk and 18 eco-compatible fragments, which were validated on 1,400 external chemicals, including antibiotics, new compounds and organic ligands for hybrid perovskite materials. This approach enables early-stage chemical screening for both biological safety and ecological compatibility, supporting sustainable materials design. Notably, in data-scarce scenarios, eFragD effectively prioritized low-toxicity cyanocides, demonstrating its utility in balancing environmental efficacy with long-term sustainability. This work bridges computational toxicology and sustainability science, providing a scalable framework for green chemistry, water quality protection, and evidence-based chemical regulation. Toxicology Artificial Intelligence and Machine Learning Toxicology Computational Chemistry Fragment-Based Machine Learning DFT-Derived Descriptors Interdisciplinary risk assessment Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction The global pursuit of environmental sustainability requires not only the mitigation of existing pollutants but also the proactive development of safe and eco-compatible chemicals 1 . Aquatic ecosystems, which are critical to ensuring freshwater availability and sustaining biodiversity, exhibit heightened vulnerability to chemical disturbances originating from industrial and pharmaceutical sources, particularly through the proliferation of harmful algal blooms (HABs) 2 , 3 . While some chemicals mitigate HABs, many exert unintended toxic effects on non-target aquatic species, including ecologically important cyanobacteria. With tens of thousands of novel compounds being introduced into the environment annually, there is an increasing need to evaluate their safety for both human health and the long-term resilience of ecosystems, as well as the sustainability of water systems 4 5 . Various risk assessment tools, such as the Toxicity Estimation Software Tool (T.E.S.T.) 6 and the Ecological Structure Activity Relationships (ECOSAR) 7 , rely on structural analogs or extensive toxicity data. Due to interspecies variability, these approaches often struggle to generalize across organisms. Artificial intelligence (AI)-based methods have gained increasing attention in environmental toxicity modeling 8 -11 . Among them, message-passing neural networks (MPNN) 10 and AttentiveFP 11 have demonstrated strong predictive capabilities across various molecular endpoints. However, the limited availability of high-quality datasets like cyanobacteria making them less suitable for proactive, sustainability-oriented chemical screening. Interpretability is another challenge. Understanding which structural features drive biological effects is essential for designing of safer, more sustainable chemicals. Tools like GNNExplainer 12 or directly employing AttentiveFP enable visualization of atom-level contributions. However, their outputs often depend on model training and initialization, and may not correspond to chemically meaningful substructures. For molecular prediction models, the most intuitive data to be used is the chemical substructure information. Conventional fragmentation approaches, including predefined substructure patterns methods (e.g., structural alerts in ToxAlerts 13 and Bioalerts 14 and Molecular ACCess System (MACCS) 15 ), as well as retrosynthesis-based schemes such as Combinatorial Analysis Procedure (RECAP) 16 and Breaking of Retro-synthetically Interesting Chemical Fragments (BRICS) 17,18 , offer interpretable representations. However, these methods often fail to capture subtle electronic or contextual differences, making it difficult to distinguish between structurally similar molecules with vastly different biological effects—a phenomenon known as the “activity cliff.” Here, we developed a molecular representation method, called atomic charge enhanced Fragment Dictionary (eFragD). This approach encodes chemically meaningful substructures and integrates atomic charge derived from DFT calculations (Fig. 1). By introducing electrostatic information, eFragD improves both the sensitivity of predictions to subtle structural changes and the interpretability of model outputs. We demonstrate the utility of eFragD in predicting molecular bioactivity for two sustainability-relevant endpoints: rodent acute toxicity and cyanocidal activity. By identifying and leveraging low-toxicity fragments, eFragD enables the discovery of environmentally safer compounds, including potential cyanocides and organic ligands for hybrid perovskite materials. This interpretable and scalable framework provides a step toward integrating green chemistry principles into predictive toxicology and sustainable water management. Result and Discussion Enhancing fragment–electrostatic framework for sustainable chemical design The eFragD approach links chemically meaningful substructures with electronic information to predict both toxicity and environmental safety (see detailed construction process of eFragD in Methods ). The eFragD collection has 155 molecular fragments, encompassing commonly used functional groups, molecular fragments, patterns formed by multiple fragments (Fig. 2a). The eFragD introduces a hierarchical and chemically intuitive fragmentation strategy by: (1) defining fragments based on functional relevance and chemical intuition; (2) preserving key structural contexts (e.g., differentiating phenolic –OH groups from aliphatic hydroxyls groups); and (3) integrating fragment-level quantum descriptors such as Hirshfeld charges to reflect local electrophilicity and reactivity. This systematic integration retains the interpretability of classical fragment-based models while overcoming their lack of electronic specificity and functional coherence. By jointly encoding structural topology and electrostatic environment, eFragD bridges the gap between qualitative interpretability and mechanistic precision. It enables fine-grained fragment-level attribution, enhances toxicity prediction accuracy, and facilitates the proactive identification of ecologically safer chemical candidates. To guide fragment selection with improved interpretability, a self-attention mechanism was introduced into a multilevel attention graph convolutional neural network (GCNN) model 19 to identify substructures that are most relevant to toxicity prediction (see Supporting Section 1, Figures S1-S2 and Table S1). This attention-based analysis provided a data-driven strategy to highlight chemically meaningful regions (Table S2), which were then incorporated into the construction of the eFragD dictionary. For an input structure (e.g., CAS: 64249-01-0 as an example illustrated in Fig. 2b), the program detected its fragments including halogen, amide and phosphate-like groups in eFragD. Notably, halogen and amide groups in eFragD were identified as being attached to the benzene ring. Charge properties were incorporated into the identified fragments. Atomic Hirshfeld charges, calculated by Multiwfn 20 , are reported to be positively correlated with electrophilicity 21 . The summation of atomic charges gave the total charge of the fragment, which is taken as an indicator of local electrophilicity in a compound. The output table listed the product of numbers and charge values of different fragments, forming a set of items (features, N mn × q mn ) for molecular toxicity prediction and fragments analysis. The toxic fragments are labeled as SAs (red), while the non-toxic fragments are labeled in blue. To verify the robustness of the model, 18 fragments with low acute toxicity were selected for cyanocidal activity prediction, yielding effective and environmentally relevant candidates. This enables ecologically targeted strategies that minimize collateral toxicity, supporting both water quality improvement and primary productivity conservation (Fig. 2c). eFragD descriptors in toxicity modeling Leveraging abundant rat oral acute toxicity data, eFragD descriptors, labeled as basic functional groups and potential toxicophores 22 , were trained to build a model that balances effectiveness and interpretability for toxicity prediction. After preprocessing (Supporting Section 2, Tables S3-S8 and Fig. S3), 42 most frequently occurring fragments and ten global metrics ( F sp3 , M HA , N HET , ∆ q , etc.) assigned Feature 1-Feature 52 (listed in Table S9) were ultimately selected as inputs for our ML models. The 42 most frequent fragment features were further analyzed to assess the impact of fragments on toxicity. The top 10 representative fragments associated with non-toxic or toxic compounds are visualized in Fig. 3a, respectively. All fragments were categorized as "Toxic" (including Danger in Category I and Warning in Category II) or "Safe" (Category III), with details provided in Tables S10-S12. In Fig. 3a, the most toxic fragments are phosphate-like groups and oxime functional groups, followed by P = O a double bonds. In contrast, if a chemical contains sulfonic acid groups, aldehydes, carboxyl, and hydroxyl groups (unattached to tertiary carbons, called NoTert-OH), these chemicals have relatively higher LD 50 values, indicating lower toxicity. Since acute toxicity of molecules ultimately depends on the whole molecular structure containing multiple fragments, it is necessary to combine fragments with ML methods for toxicity assessment. In the initial stage of molecular toxicity assessment, the multiclassification in three categories (I-III) was carried out. The Random Forest (RF) classification algorithm (eFragD-RFC) and a weighted-average method were employed to predict toxicity multi-labels across the entire dataset. The Receiver Operating Characteristic (ROC) curves and multiclass confusion matrix are presented in Figs. 3b and S4, respectively. As indicated in Fig. 3b, the eFragD-RFC model achieved a macro-average area under the curve (AUC) of 0.82, showing high performance in multiclass toxicity prediction. The accuracy of the eFragD-RFC model was 88.0% for the training set and 73.4% for the test set (Fig. S4), demonstrating a correctly labeled prediction of the oral toxicity of most compounds. To further quantitatively predict Log(LD 50 ) values for rat oral toxicity, 17 individual regression models were generated using various ML algorithms 23 and eFragD descriptors. The parameters and performances of these models are presented in Supporting Section 3, Table S13 and Fig. S5. As highlighted in Fig. 3c, the RF regression algorithm (eFragD-RFR) outperformed the others, with predicted values matching experimental data. The prediction accuracy is well within acceptable limits for biological experiments. Other molecular descriptors, e.g., RDKit descriptors 24 , MACCS and ECFP6 25 were calculated to compare the performance of the eFragD descriptors in the eFragD-RFR model, as shown in Fig. S6. These results indicate that the eFragD descriptors yield outcomes comparable to those obtained using RDKit and MACCS fingerprints (ECFP6 performing relatively less effectively), despite utilizing significantly fewer descriptors. SHapley Additive Shapley (SHAP) analysis 26 based on eFragD-RFR model was used to calculate the attributions of features. The top-10 most influential features were visualized in Fig. 3d, with mean contribution values for the entire dataset ranked on the left. A positive contribution indicates that the descriptor increases the toxicity values, while negative values suggest reduced toxicity values. The phosphorus-containing groups (P = O a double bonds and phosphate-like groups), nitrogen-related groups (-NH, amino and oxime groups) and halogen atoms, which appear as dispersive and high-density SHAP values (red dots) in negative contributions, are beneficial to decrease the model’s prediction of the molecular Log(LD 50 ) values. Notably, higher atomic charge difference (∆ q ) values, reflecting enhanced molecular electrophilicity and reactivity, correspond to lower Log(LD 50 ) values and thus higher toxicity. Similarly, the total mass of heavy atoms ( M HA ) in input molecules shows a negative correlation with molecular acute toxicity values. In contrast, the carboxyl group and the fraction of saturated carbons ( F sp3 ) enhances the hydrophilicity to favor the non-toxicity. The feature attributions reproduce well the known chemical knowledge. The correlations between rat oral acute toxicity and various features, as well as the importance of features assessed by the feature_importances_ and the Pearson correlation coefficient ( r ), exhibit minimal differences (Fig. S7). These fragment-level attributions reproduce known toxicological knowledge while offering interpretable mechanistic insights. Interestingly, phosphorus fragments were highly enriched in Category I compounds (top three entries of Table S10), raising the question: Why are the phosphate groups, commonly found in biological molecules like DNA, RNA and ATP, considered highly toxic? Phosphate-like groups, particularly organophosphates, are known as effective components in insecticides that disrupt the nervous system by inhibiting acetylcholinesterase 27 , 28 . The toxicity is not due to the phosphate group itself in its natural form, but rather to its specific chemical environment and structure. According to frequency analysis (Tables S14 and S15), isolated phosphate groups are typically classified as non-toxic. However, when oxygen atoms in phosphate are replaced by sulfur atoms to form thiophosphates, the reactivity increases significantly as shown in Fig. 3e. In particular, the sulfur-phosphorus bond contributes the most to compound toxicity, making it a key factor in toxicity prediction, consistent with literature on the toxicity of organosulfur-phosphorus compounds 29 , 30 . Notably, eFragD successfully differentiates between benign phosphate-containing structures and their hazardous analogs. This capacity enables precise toxicity screening and supports the identification of safer, eco-compatible alternatives in molecular design—advancing sustainability-aligned chemical selection. Substitution analyses confirmed the toxicological influence of specific fragments, depicted in Fig. 3e. From SHAP analysis, chlorine (Cl) and amino/amide groups were identified as contributors to toxicity, with chlorine associated with increased toxicity (red) and amino/amide groups linked to reduced toxicity (blue). Replacing a Cl atom or hydrogen atom with an amino or amide group decreases the molecular toxicity. Subsequently, several physicochemical descriptors were gathered to provide supplementary insights into the properties influencing molecular toxicity. These include the number of hydrogen bond donors ( N H-D ) and acceptors ( N H-A ) (from the RDKit package), molecular polarity index (MPI) (calculated with Multiwfn software), molecular polarizability ( α ) (using Gaussian 16), and PoLogP model 18 , 31 , 32 . For comparison, these global descriptors added to eFragD descriptors were used to build the RF regression model. As shown in Fig. S8, the inclusion of descriptors like Log P , α , and MPI did not significantly improve the prediction performance of the RFR model. Our fragment-based approach achieves prediction results comparable to conventional descriptors. External validation was performed using 1,283 chemicals not included in the training dataset and had experimental rat oral LD 50 values. The results demonstrated that the prediction accuracy remained robust and showed improved performance compared to the T.E.S.T. software (specific results shown in Fig. S9). Beyond predictive accuracy, eFragD enabled interpretable identification of high-risk fragments, supporting safer-by-design chemical innovation. To extend eFragD toward sustainable materials development, we applied the framework to predict the acute toxicity of 122 organic ligands proposed for hybrid perovskite materials (not included in our training data) from recent literature 33 . Using eFragD descriptors and the pre-trained eFragD-RFR model, the acute oral toxicity (LD 50 value) of these ligands were predicted (see Fig. S10). Most candidates exhibited moderate predicted toxicity, while a few were labeled as potentially low- or high-toxicity, highlighting the importance of systematic screening in material design. Furthermore, the fragment-level interpretability of eFragD facilitated the identification of toxicity-associated structural features. For instance, polarized α , β -unsaturated fragments—such as thiophene, furan, and pyrrole (e.g., CAS 53916-75-9, 118488-08-7, 29709-35-1)—were flagged for potential reactivity, reflecting known Michael acceptor behavior. These results demonstrate that eFragD can serve as an interpretable toxicity filter for large chemical libraries, supporting environmentally conscious materials development. Toxicological effects of coexisting fragments As chemical structures become more complex, fragment-based analysis offers an effective strategy to reveal underlying patterns and extract interpretable chemical knowledge related to molecular toxicity. However, since biological effects arise from the full molecular context, the impact of individual fragments may shift depending on their neighboring groups and overall structure. This raises the question of whether fragment coexistence results in synergistic effects, like "adding fuel to the fire" (increasing toxicity), or antagonistic effects, similar to "extinguishing the fire" (reducing toxicity) (Fig. 4a). To explore this, the interaction between fragments and the whole molecular structure is examined through case studies (Figs. 4b-4d). The statistical results are shown in color coding to distinguish positive (blue), negative (red), and neutral (gray) associations. When fragments exhibit synergistic effects, their combination can lead to greater toxicity than a single structural alert (SA). For instance, while sulfonic acid groups and their derivatives are non-toxic on their own, their toxicity rises when combined with other toxic groups such as thiophene, benzo five-membered diazepine-containing heterocyclic (B5MDH) groups and phosphate-like groups, etc., as indicated by the different red dots in Figs. 4b and 4c, where the effect of the B5MDH groups is particularly pronounced. Notably, in our dataset of 144 molecules containing the benzimidazole group (specific fragment of B5MDH groups) (Fig. 4d), 127 (approximately 88%) fall into toxicity category I. Further investigation revealed a clear synergistic effect when benzimidazole coexists with a trifluoromethyl group, namely 2-(trifluoromethyl)benzimidazole (2TFMBI), which has an LD 50 of 28 mg/kg, categorizing it as danger. Among the 128 molecules containing this group in the dataset, the logarithmic toxicity values range from − 0.70 to 2.38, all within category I. Certain fragments, like anhydrides (non-toxic on their own) and pyrazoles, become toxic or exhibit increased toxicity when attached to a benzene ring (Table S16). The attachment increased the negative charge of the fragment ( q A ), thereby enhancing the reactivity of electrophiles. This may lead to covalent bonding with endogenous nucleophilic proteins (such as receptors and enzymes) and nucleic acids (including DNA and RNA), resulting in covalent binding to electron-rich sites and potentially causing adverse outcomes. In contrast, some fragment pairs exhibit antagonistic effects, where the coexistence of two groups (at least one of which is toxic) leads to reduced toxicity. For example, sulfonic acid combined with halogenated benzene or azo groups is illustrated in the blue circles of Fig. 4b (the blue spherical section represents the molecules include trifluoromethyl-benzimidazole, a highly toxic fragment). The coexistence of furan and nitro groups can decrease the toxicity of a molecule to non-toxic levels (Log(LD 50 ) > 3). These observations highlight that the toxicological effects of fragments are not solely additive or subtractive. The toxicity of a molecule can either be potentiated or mitigated depending on its chemical environment. Thus, ML models developed in this work may play an important role to discover such a non-linear relationship, facilitating the development of safer and more sustainable chemicals. Similar molecule analysis using eFragD descriptors Understanding toxicity variations among structurally similar compounds remains a key challenge in molecular risk assessment. To evaluate the sensitivity of the eFragD-RFR model in capturing subtle structure–toxicity relationships, small datasets of structural analogs were constructed. These included compounds sharing the same scaffold but differing in fragment composition, attachment positions, or localized electronic environments. A simple example involves nitrobenzene-based molecules, where the replacement of fragment causes changes in toxicity. Nitrobenzene is known as a toxophore for its widespread mutagenicity, genotoxicity, and carcinogenicity, primarily attributed to the electrophilic nature of the nitro group. [27] However, not all molecules containing a nitrobenzene group are toxic. Various features we defined were analyzed to assess their correlation with toxicity, and the five most relevant descriptors are illustrated in Fig. 5a. A strong relationship was observed between the atomic charge difference (∆ q ) and toxicity values (Log(LD 50 )). In Fig. 5b, the values of ∆ q for toxic and non-toxic compounds are highly clustered (red and blue dots), with a boundary at ∆ q = 0.6. For nitrobenzene derivatives, as shown in Fig. 5c, the charges of substituents ( q A ) and ∆ q of a molecule alone were sufficient to predict toxicity with high accuracy (RFR model). These findings were also consistent with the fact that electron-donating substituents heighten the electron density in the nitrobenzene ring, yielding an increased reactive electrophilicity. This, in turn, makes the molecule more susceptible to oxidation by the cytochrome P450 enzyme system (CYP450), producing electrophilic metabolites that contribute to toxicity, particularly hepatotoxicity. Another sub-dataset was analyzed to visualize the toxicity variation among similar molecules with the same A groups (A = ester/amide group) but different fragment charges, as shown in Figs. 5d-5f. In Fig. 5d, the Sure Independence Screening and Sparsifying Operator (SISSO) method 34 was employed to construct new descriptors ( D 1 and D 2 ) derived from the four toxicity-relevant eFragD features. Notably, SISSO descriptor D 1 exhibited a significantly stronger positive correlation with molecular toxicity compared to eFragD features (Fig. 5e). And the RFR model achieved a perfect performance with the D 1 and D 2 descriptors (Fig. 5f). The expression of D 1 and molecular analysis revealed that parameters such as M HA , q A , and q min enhance the D 1 values, while the branching carbon number ( N C ) has an inverse effect. Notably, the charge values of fragment A ( q A ) and q min were negative. These findings indicate that increasingly negative values of q A and q min correlate with enhanced nucleophilic reactivity in molecules, which is associated with elevated acute toxicity. Moreover, the observed escalation in toxicity corresponding to increased N C values can be attributed to augmented hydrophobicity, which amplifies bioaccumulation potential and facilitates interactions with biological targets. These results demonstrate that eFragD’s atom- and fragment-level descriptors effectively capture subtle structural variations, enabling accurate and reliable toxicity prediction. This structural sensitivity reinforces eFragD’s value in supporting informed, sustainable chemical design. Cross-domain fragment projection of cyanocidal activity Building on validated predictions of low rodent toxicity, the eFragD framework was extended to identify cyanocides with minimal ecological risk. This cross-domain application leverages the structural similarity between toxicological mechanisms in animals and cyanobacteria, which often share reactive functional groups and metabolic pathways (Supporting Section 4, Tables S17 and S18). Based on extensive acute toxicity data, the eFragD approach has been developed to focus on molecular fragments. This raises the question of whether eFragD descriptors can effectively screen for cyanocides, thus further demonstrating its potential as a robust, interpretable tool for sustainable water quality management. Our analysis included 50 compounds with potential cyanocidal activity, comprising both our experimentally confirmed cyanocides (Supporting Sections 4.1–4.4, Tables S19, Figures S11 and S12) and those reported in the literature (Table S20) 35 – 40 . The descriptors were extracted by eFragD method, including the number of low rodent-toxicity fragments ( N Frag ), fragment density per unit area ( N Frag /Å 2 ) and the number of aromatic rings ( N AR ). Some physicochemical properties (e.g., LogP, α , and MPI) were also assessed to determine their inhibitory effects at 96 h (IE 96 h , %) on cyanobacterial growth with significant inhibitory effect (SIE) or no observed effect (NOE) (Fig. 6) (specific analysis seen in Supporting Section 4.5). Correlation analysis identified six descriptors that are most strongly associated with cyanocidal activity (Fig. S13). These descriptors were combined into a composite feature, D 1 Cyanos , by SISSO method. Using a random forest (RF) binary classification model (Classification criteria are detailed in Supporting Section 4.5), D 1 Cyanos descriptor was found to effectively distinguish SIE from NOE (Fig. 6a). Further analysis (Figs. 6b and S13) of distinct molecular properties revealed that inhibitory effects of molecules tend to exhibit an inverse relationship with LogP and N AR , but positively related to N Frag /Ų and MPI. Hydrophilic compounds with stronger hydrogen-bonding capacity or multiple bioactive fragments, engaging in electrostatic interactions with biomolecules (such as enzymes and proteins), tend to exhibit more effective cyanocidal activity. Molecule-protein interaction for effective fragment analysis Molecular docking studies were conducted to analyze the contribution of molecular fragments and their interactions to the binding efficiency with proteins, shedding light on the inhibition mechanism at an atomic level. Non-heme iron proteins, which are crucial for cyanobacterial growth in electron transport and redox reactions during photosynthesis, contain non-heme iron as a key component of many oxidases and dioxygenases, and have been shown to serve as strong binding sites for small molecules 41 – 43 . Therefore, molecular protein-binding affinities were assessed using the AutoDock Tools 4.2 44 (excluding molecules with LogP > 4, details and analysis seen in Supporting Section 4.6, Figures S14 and S15). Binding energies prediction (Fig. S14) revealed that most molecules exhibit strong binding affinity with proteins. Statistical analysis in Fig. 6c (binding energies corresponding to molecular fragments) highlighted that carboxyl, hydroxyl and ester groups exhibited strong binding affinities with target proteins. The significant interactions with the residues like hydrogen bonds and π-π interactions are observed in Fig. S15. Figure 6d presents a quadrant plot that highlights the relationship between rat oral toxicity (Log(LD 50 )) and cyanobacterial inhibitory effect (IE 96 h ). Quadrant I, contains 13 compounds that exhibit both low rodent toxicity (Log(LD 50 ) >3.301, LD 50 values listed in Tables S19 and S20) and high cyanocidal activity (IE 96 h >20%, IE 96 h values listed in Table S21), such as acetylacetone (Log(LD 50 ) = 3.456 & IE 96 h = 83%), are ideal candidates for cyanocides. These findings highlight the utility of the eFragD-RF model in identifying dual-benefit molecules, which are both environmentally safer and biologically effective. Conclusion This study developed an atomic charge-enhanced molecular fragmentation method, eFragD, that integrates chemically meaningful substructures with electrostatic information derived from DFT calculations. By capturing subtle variations in fragment properties, eFragD improves the sensitivity and interpretability of molecular bioactivity predictions. Our method supports automatic extraction of user-defined low-toxicity or risk-associated fragments, enabling interpretable insight into structure–activity relationships. eFragD-based descriptors, when applied to classification and regression models, demonstrated predictive performance that matches or exceeds established computational tools such as T.E.S.T.. Notably, eFragD facilitated the discovery of environmentally safer chemical candidates by identifying non-toxic motifs, with successful applications in predicting both rodent acute toxicity and cyanocidal activity. The framework further shows promise for sustainable material development, such as organic ligands for hybrid perovskites. By combining predictive accuracy with fragment-level interpretability, eFragD represents a scalable step toward embedding green chemistry principles into predictive toxicology and sustainable water management. Future extensions may incorporate 3D structural descriptors to enhance generalization across broader biological endpoints. eFragD descriptors could also be embedded within deep learning frameworks, allowing for end-to-end learning while retaining interpretable fragment-level contributions. Methods Acute toxicity dataset We collected a dataset of rat oral acute toxicity (experimental LD 50 values, mg/kg), consisting of 7,804 data samples. The sampled compounds composed of ten elements (C, H, O, N, F, Cl, Br, I, P, and S). The LD 50 values span a broad range from 0.1 to 7×10 4 mg/kg, which were converted to Log 10 (LD 50 , mg/kg), labeled as Log(LD 50 ). The SMILES style was converted into 3D structures through geometry optimizations and frequency calculations using the DFT calculations at the level of M06-2X/6-311G(d, p) method with a water solvent model (IEFPCM) as implemented in Gaussian software 31 . The SDD basis set was used for the molecules containing iodine. Hirshfeld atomic charge calculations were performed by Hirshfeld population method through Multiwfn 3.8 software 18 . 1,283 external samples with experimental rat oral LD 50 values were gathered from T.E.S.T. database 6 as an external test set. Dataset curation To ensure data quality, a meticulous multi-step data preparation process was carried out as the following steps: (1) Duplicate records and molecules with conflicting toxicity values were removed, (2) Compounds containing inorganic and organometallic substances, salts, charged species, and mixtures were precluded, (3) Tautomer and compounds with molecular weight more than 800 were excluded, (4) For a specific molecule, if the experimental toxicity value varied in different sources, the compound was eliminated. Finally, a toxicity dataset containing 7,804 chemicals was obtained. eFragD construction The construction process of eFragD statistcal model is illustrated in Extended Data Fig. 1. As depicted in Extended Data Fig. 1a, the framework consists of three parts: the construction of a basic fragment-based dictionary, including fragments and atomic charge, the match and extraction of molecular fragments and the generation of molecular features for the downstream task. The rationale of the initial step is mainly to obtain labels of general fragments semantics (divided into rings/chains parts) with clear boundary rules, which could be adjusted for specific tasks, serving as the standardization of eFragD (Table S22). Next, the related information from the input molecular structure files is extracted to match with the compiled basic dictionary. Specifically, this task involves the detection of atomic strings within the input molecule and the collection of their nearest neighbors, including the number ( N connect ) and type ( T connect ) of connected atoms. In cases where ring structures are present in a molecule, the number of atoms comprising the ring ( N ring ) is first determined. The collected data is encoded as input, which is then classified according to the basic dictionary. The encoding can be further categorized by considering different local environments around the fragments. Utilizing the collected information, chemically meaningful fragments are identified and quantified, including their types, numbers ( N mn ) and Hirshfeld charges ( q mn ) of the molecular fragments. The structural information is quantified into a matrix A (with dimensions M×N). To complement fragment-level information, several global molecular descriptors were included to capture whole-molecule properties relevant to toxicity. Specifically, descriptors (dimension 1×N) like the fraction of saturated carbons ( F sp3 ), the total mass of heavy atoms ( M HA ), the number of heteroatoms ( N HET ), the maximum/minimum atomic charges ( q max / q min ) and the charge difference (∆ q ), the number of quaternary/tertiary/secondary/primary carbon ( N C / N CH / N CH2 / N CH3 ), also calculated via the eFragD method, serve as additional features for the machine learning model. Each compound was represented by a hybrid feature set combining chemically interpretable fragment-based features ( N mn × q mn ) and global structural attributes, forming a unified input vector. This strategy enabled the model to simultaneously capture local electrophilic character (from DFT-derived fragment charges) and overall physicochemical context (from global features), enhancing the robustness and generalizability of toxicity prediction. Extended Data Fig. 1b presents two examples illustrating the identification rules for eFragD construction. In the upper section of Extended Data Fig. 1b, phosphate fragments are auto-discriminated by recognizing keywords derived from the eFragD basic dictionary. These rules primarily involve the evaluation of N connect values (where 1 and 2 denote oxygen atoms connected by single and double bonds, respectively) and T connect (phosphate) of oxygen strings, in combination with N connect of 4 and T connect of oxygen in phosphate string. Notice that based on this fragmentation, only the oxygen atom possessing a C/H atom (R 4 boundary rules) on other sides of the phosphate group is considered to be valid. Emphasizing the implementation of a priority scheme is crucial, which prioritizes more intricate and explicit fragments. As exemplified by the lower section of Extended Data Fig. 1b, some amide-like structures are firstly detected as simple amide fragments at the lower layers and then amalgamates into higher-level entities (in number of atoms), which are given greater priority. The more complex fragments were also recorded as the generic one, as these generic fragments are often sufficient to explain molecular properties and are more frequently present to match test compounds. The same fragments (e.g., amide) could also be further categorized based on different surrounding chemical environments, including double bonds, cyclic structures, and benzene rings. 155 eFragD fragments were collected in Extended Data Fig. 2, and the boundary rules are detailed in Supporting Section 5 and Table S22. Declarations Data availability All original toxicology datasets used in our study are available from the ToxiVerse database 45 , the web of PubChem 46 and the T.E.S.T. database, the corresponding links are as follows: ToxiVerse, https://toxiverse.com/toxdata; PubChem, https://pubchem.ncbi.nlm.nih.gov/; T.E.S.T., https://www.epa.gov/comptox-tools/toxicity-estimation-software-tool-test. Code availability The procedure and a skeleton Python script for feature analysis using the eFragD method are described in Supporting Section 6 and Fig. S16. A simplified and executable Python script, along with a quick_demo for running a sample fragmentation workflow, is included in the eFragD_submission_code_package to demonstrate the key feature extraction. The full source code used for model training and evaluation is available from the corresponding author upon reasonable request. Acknowledgements Financial support for this work was provided by the National Key Research and Development Program of China (2023YFB3813001), the National Natural Science Foundation of China (grant no. 22033004, 22373049, 22336002, 22425603), the Natural Science Foundation of Jiangsu Province (BK20232012), and the Jiangsu Funding Program for Excellent Postdoctoral Talent (grant no. 2023ZB655). Author information These authors contributed equally: Chen-Chen Zhao, Shaoyi Hou. Authors and affiliations State Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing, Jiangsu 210023, P. R. China Chen-Chen Zhao, Shaoyi Hou, Cheng Fu, Guoqiang Wang and Jing Ma State Key Laboratory of Pollution Control and Resource Reuse, School of Environment, Nanjing University, Nanjing, Jiangsu 210023, P. R. China Peng Peng, Shujuan Zhang State Key Laboratory of Chemo/Biosensing and Chemometrics, Advanced Catalytic Engineering Research Center of the Ministry of Education, College of Chemistry and Chemical Engineering, Hunan University, Changsha, Hunan 410082, P. R. China Jian-Jun Feng Author contributions C. Zhao designed the study, curated the data, and wrote the manuscript. S. Hou developed the code of eFragD method. C. Fu performed statistical analyses. P. Peng conducted the cyanobacterial inhibition experiments. G. Wang and J. Feng provided the new molecules for experiments. S. Zhang and J. Ma acquired funding, supervised the research, contributed to writing, review and editing. All co-authors contributed to providing feedback on the manuscript and figures. References Johnson, A. C. et al. Sumpter Learning from the past and considering the future of chemicals in the environment. Science 367, 384-387 (2020). Zhang, Q. et al. Cyanobacterial blooms contribute to the diversity of antibiotic-resistance genes in aquatic ecosystems. Commun. Biol. 3 , 737 (2020). Feng, L. et al. Harmful algal blooms in inland waters. Nat. Rev. Earth Environ. 5 , 631-644 (2024). Zhang, Y. et al. Chemical contaminants in blood and their implications in chronic diseases. J. Hazard. Mater. 466 , 133511 (2024). Zou, H. et al. Continuing large-scale global trade and illegal trade of highly hazardous chemicals. Nat. Sustain. 6 , 1394-1405 (2023). U. S. Environmental Protection Agency. User’s guide for T.E.S.T. (version 5.1) (toxicity estimation software tool): a program to estimate toxicity from molecular structure. https://www.epa.gov/comptox-tools/toxicity-estimation-software-tool-test (2020). U. S. Environmental Protection Agency. Ecological structure-activity relationships program (ECOSAR) operation manual v2.2. https://www.epa.gov/tsca-screening-tools/ecological-structure-activity-relationships-ecosar-predictive-model (2022). Bai, C. et al. Machine Learning Enabled Drug Induced Toxicity Prediction. Adv. Sci. 12 , 2413405 (2025). Ketkar, R. et al. A Benchmark Study of Graph Models for Molecular Acute Toxicity Prediction. Int. J. Mol. Sci. 24 , 11966 (2023). Gilmer, J. et al. Neural Message Passing for Quantum Chemistry. In Proc. 34th International Conference on Machine Learning arXiv:1704.01212 (2017). Xiong, Z. et al. Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. J. Med. Chem. 63 , 8749-8760 (2019). Ying, R. et al. GNNExplainer: Generating Explanations for Graph Neural Networks. In Proc. 33rd International Conference on Neural Information Processing Systems arXiv:1903.03894 (2019). Patlewicz, G. et al. An evaluation of the implementation of the cramer classification scheme in the Toxtree software. SAR QSAR Environ. Res. 19 , 495-524 (2008). Cortes-Ciriano, I. Bioalerts: a python library for the derivation of structural alerts from bioactivity and toxicity data sets. J. Cheminf. 8 , 13-18 (2016). Durant, J. L. et al. Reoptimization of MDL keys for use in drug discovery. J. Chem. Inf. Comput. Sci. 42 , 1273-1280 (2002). Lewell, X. Q. et al. RECAP--retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. J. Chem. Inf. Comput. Sci. 38 , 511-522 (1998). Degen, J. et al. On the art of compiling and using 'drug-like' chemical fragment spaces. ChemMedChem 3 , 1503-1507 (2008). Vangala, S. R. et al. pBRICS: a novel fragmentation method for explainable property prediction of drug-like small molecules. J. Chem. Inf. Model. 63 , 5066-5076 (2023). Liu, Z. et al. Machine learning on properties of multiscale multisource hydroxyapatite nanoparticles datasets with different morphologies and sizes. npj Comput. Mater. 7 , 142 (2021). Lu, T. et al. Multiwfn: a multifunctional wavefunction analyzer. J. Comput. Chem. 33 , 580-592 (2012). Liu, S. et al. Information conservation principle determines electrophilicity, nucleophilicity, and regioselectivity. J. Phys. Chem. A 118 , 3698-3704 (2014). Sushko, I. et al. ToxAlerts: a web server of structural alerts for toxic chemicals and compounds with potential adverse reactions. J. Chem. Inf. Model. 52 , 2310-2316 (2012). Chen, J. et al. Automated machine learning of interfacial interaction descriptors and energies in metal-catalyzed N 2 and CO 2 reduction reactions. Langmuir 41 , 3490-3502 (2025). Landrum, G. RDKit: Open-source cheminformatics software. https://www.rdkit.org/ (2006). Rogers, D. et al. Extended connectivity fingerprints. J. Chem. Inf. Model. 50 , 742-754 (2010). Lundberg S. et al. A unified approach to interpreting model predictions. In Proc. 31st International Conference on Neural Information Processing Systemss vol. 4768-4777 (2017). Li, X. et al. In silico prediction of chemical acute oral toxicity using multi-classification methods. J. Chem. Inf. Model. 54 , 1061-1069 (2014). Di Stefano, M. et al. VenomPred 2.0: a novel in silico platform for an extended and human interpretable toxicological profiling of small molecules. J. Chem. Inf. Model. 64 , 2275-2289 (2023). Gadaleta, D. et al. SAR and QSAR modeling of a large collection of LD 50 rat acute oral toxicity data. J. Cheminf. 11, 58-73 (2019). Machhar, J. et al. Computational prediction of toxicity of small organic molecules: state-of-the-art. Phys. Sci. Rev. 4 ,20190009(2019). Frisch, M. J. et al. Gaussian 16 Rev. C.01. (Wallingford, CT, 2016). Jia, Q. et al. Fast prediction of lipophilicity of organofluorine molecules: deep learning-derived polarity characters and experimental tests. J. Chem. Inf. Model. 62 , 4928-4936 (2022). Wu, Y. et al. Universal machine learning aided synthesis approach of two-dimensional perovskites in a typical laboratory. Nat. Commun. 15 , 138 (2024). Ouyang, R. et al. SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. Phys. Revi. Mater. 2 , 083802 (2018). Rastogi, R. P. et al. Bloom dynamics of cyanobacteria and their toxins: environmental health impacts and mitigation strategies. Front. Microbiol. 6 , 1254(2015). Batterton, J. et al. Anilines selective toxicity to blue-green algae. Science 199 , 1068-1070 (1978). Jančula, D. et al. Critical review of actually available chemical compounds for prevention and management of cyanobacterial blooms. Chemosphere 85 , 1415-1422 (2011). Nakai, S. et al. Myriophyllum spicatum -released allelopathic polyphenols inhibiting growth of blue-green algae Microcystis aeruginosa. Water Res. 34 , 3026-3032 (2000). Wei, P. et al. Efficient inhibition of cyanobacteria M. aeruginosa growth using commercial food-grade fumaric acid. Chemosphere 301 ,134659-134666(2022). Yilimulati, M. et al. Regulation of photosynthesis in bloom-forming cyanobacteria with the simplest β-diketone. Environ. Sci. Technol. 55 , 14173-14184 (2021). González, A. et al. Pivotal role of iron in the regulation of cyanobacterial electron transport. Adv. Microb. Physiol. 68 , 169-217 (2016). Gao, H. et al. The diversity and applications of microbial iron metabolism and iron-containing proteins. Commun. Biol. 8 ,177-179(2025). Schalk, I. J. Bacterial siderophores: diversity, uptake pathways and applications. Nat. Rev. Microbiol. 23 , 24-40 (2024). Morris, G. M. et al. AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem . 30 , 2785-2791 (2009). Russo, D. P. et al. Nonanimal models for acute toxicity evaluations: applying data-driven profiling and read-across. Environ. Health Perspect. 127 , 47001 (2019). Kim, S. et al. PubChem 2025 update. Nucleic Acids Res . 53 , 1516-1525 (2025). Additional Declarations The authors declare no competing interests. Supplementary Files supportinginformation20250516.docx ExtendedDataFigures.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7617003","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":515083534,"identity":"94f291b1-4e37-4733-aff4-871a79d2f69a","order_by":0,"name":"Chen-Chen Zhao","email":"","orcid":"","institution":"Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Chen-Chen","middleName":"","lastName":"Zhao","suffix":""},{"id":515084035,"identity":"0d9d09d3-9c95-4b66-8273-19b5435891e2","order_by":1,"name":"Shaoyi Hou","email":"","orcid":"","institution":"Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Shaoyi","middleName":"","lastName":"Hou","suffix":""},{"id":515084627,"identity":"4a115f73-5c6f-4eee-9a71-8325aba279d4","order_by":2,"name":"Cheng Fu","email":"","orcid":"","institution":"Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Cheng","middleName":"","lastName":"Fu","suffix":""},{"id":515084628,"identity":"fd5ea4c5-ebc6-48a0-913b-186fa74d931b","order_by":3,"name":"Peng Peng","email":"","orcid":"","institution":"Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Peng","middleName":"","lastName":"Peng","suffix":""},{"id":515084629,"identity":"85c923c0-43d2-4ba5-aa4f-6c1d54ea1325","order_by":4,"name":"Guoqiang Wang","email":"","orcid":"","institution":"Nanjing University","correspondingAuthor":false,"prefix":"","firstName":"Guoqiang","middleName":"","lastName":"Wang","suffix":""},{"id":515084630,"identity":"5ca3042a-5700-4bdc-9794-b83d71c64538","order_by":5,"name":"Jian-Jun Feng","email":"","orcid":"","institution":"Hunan University","correspondingAuthor":false,"prefix":"","firstName":"Jian-Jun","middleName":"","lastName":"Feng","suffix":""},{"id":515084631,"identity":"1a7d3b80-f7aa-43a0-bc2b-19394d50539b","order_by":6,"name":"Shujuan Zhang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA10lEQVRIiWNgGAWjYFCCA2wgUg7IAFJsJGgxJkULRFliA4JNAMg3Hn/24OeO2vTtjGcMGD6UHWbgn92AX4vBgQPphr1njufubDhjwDjj3GEGiTsHCGhhOHBMgrftWO6GA2cMmHnbDjMYSCQQcFjDwTbJv23H0g1AWv4So4XhwGE2ad62mgSwFkZitBgcOMYmLdt2wHDDgWMFB3vOpfNI3CDksBnHn0m+bauTN7hxeOODH2XWcvwzCDlM4gCIPAxmgJg8BNQDAX8DiKyDMUbBKBgFo2AUYAIAScBMCqV+VHMAAAAASUVORK5CYII=","orcid":"","institution":"Nanjing University","correspondingAuthor":true,"prefix":"","firstName":"Shujuan","middleName":"","lastName":"Zhang","suffix":""},{"id":515084632,"identity":"0fd202ca-c7da-4dea-8f25-1d3a7ffc256f","order_by":7,"name":"Jing Ma","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAyklEQVRIiWNgGAWjYHCChAMMDDYMjA1AJg8JWtJI0wIChyEUUVp02w88PFzw63we84wExgdv2xjkzQlpMTuTkHB4Zt/tYsYZCcyGc9sYDHc2ENJyAKiFt+d2YuOMBDZp3jaGBIMDhLScfwDScg6khf03cVpuAG3h+XEAbAszkVpAtjQkJzb2PGyWnHNOwnADYYflJH/m+WOXuLE9+eCHN2U28gRtAcZFAgNjGwODYQM4MiUIqgcCdqCpfxgY5IlROwpGwSgYBSMTAABWE0iNlZZ/fwAAAABJRU5ErkJggg==","orcid":"","institution":"Nanjing University","correspondingAuthor":true,"prefix":"","firstName":"Jing","middleName":"","lastName":"Ma","suffix":""}],"badges":[],"createdAt":"2025-09-15 06:50:40","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-7617003/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7617003/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":91423426,"identity":"d9a0dc48-31ec-4d6d-85b9-0cfa79b01779","added_by":"auto","created_at":"2025-09-16 10:38:55","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":867237,"visible":true,"origin":"","legend":"\u003cp\u003eAtomic charge enhanced molecular representation of eFragD framework. (a) Comparison between common fragmentation strategies with eFragD. (b) Enhanced differentiation of subtle fragment features enabled by electrostatic information in eFragD. (c) Versatile applications of eFragD in eco-safety and environmental risk assessment.\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/b44b1b2c06ae3af209ad3939.png"},{"id":91423845,"identity":"4c7f22e0-5cd3-4c4e-b8ae-2de284e980aa","added_by":"auto","created_at":"2025-09-16 10:46:55","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":3441749,"visible":true,"origin":"","legend":"\u003cp\u003eFlowchart of eFragD-based machine learning method. (a) Collection of acute toxicity datasets and construction of eFragD. (b) and (c) Toxicity and cyanobacterial inhibitory activity prediction with eFragD descriptors, respectively. Therein, cyanobacterial growth inhibition was divided into significant inhibitory effect (SIE) or no observed effect (NOE). MPI, molecular polarity index.\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/c86c71dd11740f4e0eee3657.png"},{"id":91423433,"identity":"585610cf-e257-4b3b-8167-b2ec67bc56d6","added_by":"auto","created_at":"2025-09-16 10:38:56","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":1180487,"visible":true,"origin":"","legend":"\u003cp\u003eAcute toxicity performance using the eFragD-RF model. (a) Top-10 negative (non-toxic) or positive (toxic) fragments contributing to molecular toxicity. (b) Performance curve of the multiclassification model (eFragD-RFC). (c) Performance of regression model (eFragD-RFR). (d) Contributions of eFragD descriptors in the eFragD-RFR model. (e) Molecular examples showing toxicity changes due to fragment modifications.\u003c/p\u003e","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/5b00724c7a6dea83fd8f1a70.png"},{"id":91423443,"identity":"3b060402-7f55-44b4-9bc5-ab5912a46363","added_by":"auto","created_at":"2025-09-16 10:38:58","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":563379,"visible":true,"origin":"","legend":"\u003cp\u003eIllustration of fragment interactions and their impact on molecular toxicity. (a) Schematic of synergistic and antagonistic fragment interactions in toxicity prediction. (b) Toxicity distributions of molecules containing sulfonic acid with various substituents. (c) Fragments that increase toxicity when coexisting with sulfonic acid. (d) Synergistic toxicity effect of benzimidazole with sulfonic acid.\u003c/p\u003e","description":"","filename":"floatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/9454738d58acbf06f46d39e5.png"},{"id":91423440,"identity":"30931271-1de2-4ec2-827a-11b89e59f68e","added_by":"auto","created_at":"2025-09-16 10:38:57","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":734316,"visible":true,"origin":"","legend":"\u003cp\u003eDiscrimination and toxicity prediction of analogs in the dataset (nitrobenzene derivatives in a-c and carboxyl/amide substituted benzene derivatives in d-f). (a) Five most relevant eFragD descriptors and their correlation with molecular toxicity in the nitrobenzene derivatives. (b) Clustering visualization of danger (Log(LD\u003csub\u003e50\u003c/sub\u003e)≤2.48) and warning-safety (Log(LD\u003csub\u003e50\u003c/sub\u003e)>2.48) compounds based on ∆\u003cem\u003eq\u003c/em\u003e and \u003cem\u003eq\u003c/em\u003e\u003csub\u003eA\u003c/sub\u003e. (c) Experimental versus predicted toxicity values of nitrobenzene derivatives based on charge-related descriptors using eFragD-RFR model. (d) Correlation analysis based on SISSO-derived descriptors (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e and \u003cem\u003eD\u003c/em\u003e\u003csub\u003e2\u003c/sub\u003e), and molecular features. (e) Joint distribution between \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e and experimental toxicity values of carboxyl/amide-containing analogs. (f) Performance of the eFragD-RFR model in predicting toxicity of carboxyl/amide substituted benzene derivatives with \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e and \u003cem\u003eD\u003c/em\u003e\u003csub\u003e2\u003c/sub\u003e features.\u003c/p\u003e","description":"","filename":"floatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/5aca7a5fbd212ca7bcd19ec4.png"},{"id":91423442,"identity":"e39193c1-9df5-429a-b8bd-969349a22f58","added_by":"auto","created_at":"2025-09-16 10:38:58","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":1199709,"visible":true,"origin":"","legend":"\u003cp\u003eScreening of molecules for cyanocidal activity and molecular-protein interaction analysis. (a) Distribution of the six most important descriptors for cyanocidal activity. (b) Performance of the RF binary classification model with the SISSO feature (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e\u003csup\u003eCyanos\u003c/sup\u003e). (c) Molecular-protein interaction analysis by molecular docking (Protein data bank ID: 2AXT). (d) Quadrant analysis of rodent toxicity and cyanocidal activity.\u003c/p\u003e","description":"","filename":"floatimage6.png","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/5b31073815abcaf3aa02672c.png"},{"id":91423854,"identity":"c91a9050-4030-43fb-bcd5-592f7da7387c","added_by":"auto","created_at":"2025-09-16 10:47:03","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":8984401,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/ef6e6dd3-f9df-4754-9aa8-85250c83ec5a.pdf"},{"id":91423432,"identity":"093c9606-31c3-4132-be88-6a8e98070ccd","added_by":"auto","created_at":"2025-09-16 10:38:56","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":13945113,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cbr\u003e\u003c/p\u003e","description":"","filename":"supportinginformation20250516.docx","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/9e4ea7266cc875b44000eee7.docx"},{"id":91423414,"identity":"f6ef79bc-bcdb-48aa-a7f8-b7b82834ffad","added_by":"auto","created_at":"2025-09-16 10:38:53","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":579638,"visible":true,"origin":"","legend":"","description":"","filename":"ExtendedDataFigures.docx","url":"https://assets-eu.researchsquare.com/files/rs-7617003/v1/388aa982a173f37a57a75c8e.docx"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003eA transferable fragment dictionary for eco-safety and environmental risk assessment\u003c/p\u003e","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe global pursuit of environmental sustainability requires not only the mitigation of existing pollutants but also the proactive development of safe and eco-compatible chemicals\u003csup\u003e1\u003c/sup\u003e. Aquatic ecosystems, which are critical to ensuring freshwater availability and sustaining biodiversity, exhibit heightened vulnerability to chemical disturbances originating from industrial and pharmaceutical sources, particularly through the proliferation of harmful algal blooms (HABs)\u003csup\u003e\u0026nbsp;\u003c/sup\u003e\u003csup\u003e2\u003c/sup\u003e\u003csup\u003e,\u003c/sup\u003e\u003csup\u003e3\u003c/sup\u003e.\u003csup\u003e\u0026nbsp;\u003c/sup\u003eWhile some chemicals mitigate HABs, many exert unintended toxic effects on non-target aquatic species, including ecologically important cyanobacteria. With tens of thousands of novel compounds being introduced into the environment annually, there is an increasing need to evaluate their safety for both human health and the long-term resilience of ecosystems, as well as the sustainability of water systems\u003csup\u003e4\u003c/sup\u003e\u003csup\u003e\u0026nbsp;\u003c/sup\u003e\u003csup\u003e5\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eVarious risk assessment tools, such as the Toxicity Estimation Software Tool (T.E.S.T.)\u003csup\u003e6\u003c/sup\u003e\u003csup\u003e\u0026nbsp;\u003c/sup\u003eand the Ecological Structure Activity Relationships (ECOSAR)\u003csup\u003e7\u003c/sup\u003e,\u003csup\u003e\u0026nbsp;\u003c/sup\u003erely on structural analogs or extensive toxicity data. Due to interspecies variability, these approaches often struggle to generalize across organisms. Artificial intelligence (AI)-based methods have gained increasing attention in environmental toxicity modeling\u003csup\u003e8\u003c/sup\u003e\u003csup\u003e-11\u003c/sup\u003e.\u003csup\u003e\u0026nbsp;\u003c/sup\u003eAmong them, message-passing neural networks (MPNN)\u003csup\u003e10\u003c/sup\u003e and AttentiveFP\u003csup\u003e11\u003c/sup\u003e have demonstrated strong predictive capabilities across various molecular endpoints. However, the limited availability of high-quality datasets like cyanobacteria making them less suitable for proactive, sustainability-oriented chemical screening.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eInterpretability is another challenge. Understanding which structural features drive biological effects is essential for designing of safer, more sustainable chemicals. Tools like GNNExplainer\u003csup\u003e12\u003c/sup\u003e\u003csup\u003e\u0026nbsp;\u003c/sup\u003eor directly employing AttentiveFP enable visualization of atom-level contributions. However, their outputs often depend on model training and initialization, and may not correspond to chemically meaningful substructures. For molecular prediction models, the most intuitive data to be used is the chemical substructure information. Conventional fragmentation approaches, including predefined substructure patterns methods (e.g., structural alerts in ToxAlerts\u003csup\u003e13\u003c/sup\u003e and Bioalerts\u003csup\u003e14\u003c/sup\u003e and Molecular ACCess System (MACCS)\u003csup\u003e15\u003c/sup\u003e), as well as retrosynthesis-based schemes such as Combinatorial Analysis Procedure (RECAP)\u003csup\u003e16\u003c/sup\u003e and Breaking of Retro-synthetically Interesting Chemical Fragments (BRICS)\u003csup\u003e17,18\u003c/sup\u003e, offer interpretable representations. However, these methods often fail to capture subtle electronic or contextual differences, making it difficult to distinguish between structurally similar molecules with vastly different biological effects\u0026mdash;a phenomenon known as the \u0026ldquo;activity cliff.\u0026rdquo;\u003c/p\u003e\n\u003cp\u003eHere, we developed a molecular representation method, called atomic charge enhanced Fragment Dictionary (eFragD). This approach encodes chemically meaningful substructures and integrates atomic charge derived from DFT calculations (Fig. 1). By introducing electrostatic information, eFragD improves both the sensitivity of predictions to subtle structural changes and the interpretability of model outputs. We demonstrate the utility of eFragD in predicting molecular bioactivity for two sustainability-relevant endpoints: rodent acute toxicity and cyanocidal activity. By identifying and leveraging low-toxicity fragments, eFragD enables the discovery of environmentally safer compounds, including potential cyanocides and organic ligands for hybrid perovskite materials. This interpretable and scalable framework provides a step toward integrating green chemistry principles into predictive toxicology and sustainable water management.\u003c/p\u003e"},{"header":"Result and Discussion ","content":"\u003cp\u003eEnhancing fragment–electrostatic framework for sustainable chemical design\u003c/p\u003e\u003cp\u003eThe eFragD approach links chemically meaningful substructures with electronic information to predict both toxicity and environmental safety (see detailed construction process of eFragD in \u003cb\u003eMethods\u003c/b\u003e). The eFragD collection has 155 molecular fragments, encompassing commonly used functional groups, molecular fragments, patterns formed by multiple fragments (Fig.\u0026nbsp;2a). The eFragD introduces a hierarchical and chemically intuitive fragmentation strategy by: (1) defining fragments based on functional relevance and chemical intuition; (2) preserving key structural contexts (e.g., differentiating phenolic –OH groups from aliphatic hydroxyls groups); and (3) integrating fragment-level quantum descriptors such as Hirshfeld charges to reflect local electrophilicity and reactivity. This systematic integration retains the interpretability of classical fragment-based models while overcoming their lack of electronic specificity and functional coherence. By jointly encoding structural topology and electrostatic environment, eFragD bridges the gap between qualitative interpretability and mechanistic precision. It enables fine-grained fragment-level attribution, enhances toxicity prediction accuracy, and facilitates the proactive identification of ecologically safer chemical candidates.\u003c/p\u003e\u003cp\u003eTo guide fragment selection with improved interpretability, a self-attention mechanism was introduced into a multilevel attention graph convolutional neural network (GCNN) model\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e to identify substructures that are most relevant to toxicity prediction (see Supporting Section 1, Figures S1-S2 and Table S1). This attention-based analysis provided a data-driven strategy to highlight chemically meaningful regions (Table S2), which were then incorporated into the construction of the eFragD dictionary. For an input structure (e.g., CAS: 64249-01-0 as an example illustrated in Fig.\u0026nbsp;2b), the program detected its fragments including halogen, amide and phosphate-like groups in eFragD. Notably, halogen and amide groups in eFragD were identified as being attached to the benzene ring. Charge properties were incorporated into the identified fragments. Atomic Hirshfeld charges, calculated by Multiwfn\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e, are reported to be positively correlated with electrophilicity\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e. The summation of atomic charges gave the total charge of the fragment, which is taken as an indicator of local electrophilicity in a compound. The output table listed the product of numbers and charge values of different fragments, forming a set of items (features, \u003cem\u003eN\u003c/em\u003e\u003csub\u003emn\u003c/sub\u003e×\u003cem\u003eq\u003c/em\u003e\u003csub\u003emn\u003c/sub\u003e) for molecular toxicity prediction and fragments analysis. The toxic fragments are labeled as SAs (red), while the non-toxic fragments are labeled in blue. To verify the robustness of the model, 18 fragments with low acute toxicity were selected for cyanocidal activity prediction, yielding effective and environmentally relevant candidates. This enables ecologically targeted strategies that minimize collateral toxicity, supporting both water quality improvement and primary productivity conservation (Fig.\u0026nbsp;2c).\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eeFragD descriptors in toxicity modeling\u003c/b\u003e\u003c/p\u003e\u003cp\u003eLeveraging abundant rat oral acute toxicity data, eFragD descriptors, labeled as basic functional groups and potential toxicophores\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e, were trained to build a model that balances effectiveness and interpretability for toxicity prediction. After preprocessing (Supporting Section 2, Tables S3-S8 and Fig. S3), 42 most frequently occurring fragments and ten global metrics (\u003cem\u003eF\u003c/em\u003e\u003csub\u003esp3\u003c/sub\u003e, \u003cem\u003eM\u003c/em\u003e\u003csub\u003eHA\u003c/sub\u003e, \u003cem\u003eN\u003c/em\u003e\u003csub\u003eHET\u003c/sub\u003e, ∆\u003cem\u003eq\u003c/em\u003e, etc.) assigned Feature 1-Feature 52 (listed in Table S9) were ultimately selected as inputs for our ML models.\u003c/p\u003e\u003cp\u003eThe 42 most frequent fragment features were further analyzed to assess the impact of fragments on toxicity. The top 10 representative fragments associated with non-toxic or toxic compounds are visualized in Fig.\u0026nbsp;3a, respectively. All fragments were categorized as \"Toxic\" (including Danger in Category I and Warning in Category II) or \"Safe\" (Category III), with details provided in Tables S10-S12. In Fig.\u0026nbsp;3a, the most toxic fragments are phosphate-like groups and oxime functional groups, followed by P = O\u003csub\u003ea\u003c/sub\u003e double bonds. In contrast, if a chemical contains sulfonic acid groups, aldehydes, carboxyl, and hydroxyl groups (unattached to tertiary carbons, called NoTert-OH), these chemicals have relatively higher LD\u003csub\u003e50\u003c/sub\u003e values, indicating lower toxicity. Since acute toxicity of molecules ultimately depends on the whole molecular structure containing multiple fragments, it is necessary to combine fragments with ML methods for toxicity assessment.\u003c/p\u003e\u003cp\u003eIn the initial stage of molecular toxicity assessment, the multiclassification in three categories (I-III) was carried out. The Random Forest (RF) classification algorithm (eFragD-RFC) and a weighted-average method were employed to predict toxicity multi-labels across the entire dataset. The Receiver Operating Characteristic (ROC) curves and multiclass confusion matrix are presented in Figs.\u0026nbsp;3b and S4, respectively. As indicated in Fig.\u0026nbsp;3b, the eFragD-RFC model achieved a macro-average area under the curve (AUC) of 0.82, showing high performance in multiclass toxicity prediction. The accuracy of the eFragD-RFC model was 88.0% for the training set and 73.4% for the test set (Fig. S4), demonstrating a correctly labeled prediction of the oral toxicity of most compounds.\u003c/p\u003e\u003cp\u003eTo further quantitatively predict Log(LD\u003csub\u003e50\u003c/sub\u003e) values for rat oral toxicity, 17 individual regression models were generated using various ML algorithms\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e and eFragD descriptors. The parameters and performances of these models are presented in Supporting Section 3, Table S13 and Fig. S5. As highlighted in Fig.\u0026nbsp;3c, the RF regression algorithm (eFragD-RFR) outperformed the others, with predicted values matching experimental data. The prediction accuracy is well within acceptable limits for biological experiments. Other molecular descriptors, e.g., RDKit descriptors\u003csup\u003e\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e\u003c/sup\u003e, MACCS and ECFP6\u003csup\u003e25\u003c/sup\u003e were calculated to compare the performance of the eFragD descriptors in the eFragD-RFR model, as shown in Fig. S6. These results indicate that the eFragD descriptors yield outcomes comparable to those obtained using RDKit and MACCS fingerprints (ECFP6 performing relatively less effectively), despite utilizing significantly fewer descriptors.\u003c/p\u003e\u003cp\u003eSHapley Additive Shapley (SHAP) analysis\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e based on eFragD-RFR model was used to calculate the attributions of features. The top-10 most influential features were visualized in Fig.\u0026nbsp;3d, with mean contribution values for the entire dataset ranked on the left. A positive contribution indicates that the descriptor increases the toxicity values, while negative values suggest reduced toxicity values. The phosphorus-containing groups (P = O\u003csub\u003ea\u003c/sub\u003e double bonds and phosphate-like groups), nitrogen-related groups (-NH, amino and oxime groups) and halogen atoms, which appear as dispersive and high-density SHAP values (red dots) in negative contributions, are beneficial to decrease the model’s prediction of the molecular Log(LD\u003csub\u003e50\u003c/sub\u003e) values. Notably, higher atomic charge difference (∆\u003cem\u003eq\u003c/em\u003e) values, reflecting enhanced molecular electrophilicity and reactivity, correspond to lower Log(LD\u003csub\u003e50\u003c/sub\u003e) values and thus higher toxicity. Similarly, the total mass of heavy atoms (\u003cem\u003eM\u003c/em\u003e\u003csub\u003eHA\u003c/sub\u003e) in input molecules shows a negative correlation with molecular acute toxicity values. In contrast, the carboxyl group and the fraction of saturated carbons (\u003cem\u003eF\u003c/em\u003e\u003csub\u003esp3\u003c/sub\u003e) enhances the hydrophilicity to favor the non-toxicity. The feature attributions reproduce well the known chemical knowledge. The correlations between rat oral acute toxicity and various features, as well as the importance of features assessed by the \u003cem\u003efeature_importances_\u003c/em\u003e and the Pearson correlation coefficient (\u003cem\u003er\u003c/em\u003e), exhibit minimal differences (Fig. S7). These fragment-level attributions reproduce known toxicological knowledge while offering interpretable mechanistic insights.\u003c/p\u003e\u003cp\u003eInterestingly, phosphorus fragments were highly enriched in Category I compounds (top three entries of Table S10), raising the question: Why are the phosphate groups, commonly found in biological molecules like DNA, RNA and ATP, considered highly toxic?\u003c/p\u003e\u003cp\u003ePhosphate-like groups, particularly organophosphates, are known as effective components in insecticides that disrupt the nervous system by inhibiting acetylcholinesterase\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e,\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e\u003c/sup\u003e. The toxicity is not due to the phosphate group itself in its natural form, but rather to its specific chemical environment and structure. According to frequency analysis (Tables S14 and S15), isolated phosphate groups are typically classified as non-toxic. However, when oxygen atoms in phosphate are replaced by sulfur atoms to form thiophosphates, the reactivity increases significantly as shown in Fig.\u0026nbsp;3e. In particular, the sulfur-phosphorus bond contributes the most to compound toxicity, making it a key factor in toxicity prediction, consistent with literature on the toxicity of organosulfur-phosphorus compounds\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e,\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e\u003c/sup\u003e. Notably, eFragD successfully differentiates between benign phosphate-containing structures and their hazardous analogs. This capacity enables precise toxicity screening and supports the identification of safer, eco-compatible alternatives in molecular design—advancing sustainability-aligned chemical selection.\u003c/p\u003e\u003cp\u003eSubstitution analyses confirmed the toxicological influence of specific fragments, depicted in Fig.\u0026nbsp;3e. From SHAP analysis, chlorine (Cl) and amino/amide groups were identified as contributors to toxicity, with chlorine associated with increased toxicity (red) and amino/amide groups linked to reduced toxicity (blue). Replacing a Cl atom or hydrogen atom with an amino or amide group decreases the molecular toxicity. Subsequently, several physicochemical descriptors were gathered to provide supplementary insights into the properties influencing molecular toxicity. These include the number of hydrogen bond donors (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eH-D\u003c/sub\u003e) and acceptors (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eH-A\u003c/sub\u003e) (from the RDKit package), molecular polarity index (MPI) (calculated with Multiwfn software), molecular polarizability (\u003cem\u003eα\u003c/em\u003e) (using Gaussian 16), and PoLogP model\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e,\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e\u003c/sup\u003e. For comparison, these global descriptors added to eFragD descriptors were used to build the RF regression model. As shown in Fig. S8, the inclusion of descriptors like Log\u003cem\u003eP\u003c/em\u003e, \u003cem\u003eα\u003c/em\u003e, and MPI did not significantly improve the prediction performance of the RFR model. Our fragment-based approach achieves prediction results comparable to conventional descriptors.\u003c/p\u003e\u003cp\u003eExternal validation was performed using 1,283 chemicals not included in the training dataset and had experimental rat oral LD\u003csub\u003e50\u003c/sub\u003e values. The results demonstrated that the prediction accuracy remained robust and showed improved performance compared to the T.E.S.T. software (specific results shown in Fig. S9). Beyond predictive accuracy, eFragD enabled interpretable identification of high-risk fragments, supporting safer-by-design chemical innovation.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eTo extend eFragD toward sustainable materials development, we applied the framework to predict the acute toxicity of 122 organic ligands proposed for hybrid perovskite materials (not included in our training data) from recent literature\u003csup\u003e\u003cspan citationid=\"CR33\" class=\"CitationRef\"\u003e33\u003c/span\u003e\u003c/sup\u003e. Using eFragD descriptors and the pre-trained eFragD-RFR model, the acute oral toxicity (LD\u003csub\u003e50\u003c/sub\u003e value) of these ligands were predicted (see Fig. S10). Most candidates exhibited moderate predicted toxicity, while a few were labeled as potentially low- or high-toxicity, highlighting the importance of systematic screening in material design. Furthermore, the fragment-level interpretability of eFragD facilitated the identification of toxicity-associated structural features. For instance, polarized \u003cem\u003eα\u003c/em\u003e,\u003cem\u003eβ\u003c/em\u003e-unsaturated fragments—such as thiophene, furan, and pyrrole (e.g., CAS 53916-75-9, 118488-08-7, 29709-35-1)—were flagged for potential reactivity, reflecting known Michael acceptor behavior. These results demonstrate that eFragD can serve as an interpretable toxicity filter for large chemical libraries, supporting environmentally conscious materials development.\u003c/p\u003e\u003cp\u003e\u003cb\u003eToxicological effects of coexisting fragments\u003c/b\u003e\u003c/p\u003e\u003cp\u003eAs chemical structures become more complex, fragment-based analysis offers an effective strategy to reveal underlying patterns and extract interpretable chemical knowledge related to molecular toxicity. However, since biological effects arise from the full molecular context, the impact of individual fragments may shift depending on their neighboring groups and overall structure. This raises the question of whether fragment coexistence results in synergistic effects, like \"adding fuel to the fire\" (increasing toxicity), or antagonistic effects, similar to \"extinguishing the fire\" (reducing toxicity) (Fig.\u0026nbsp;4a). To explore this, the interaction between fragments and the whole molecular structure is examined through case studies (Figs.\u0026nbsp;4b-4d). The statistical results are shown in color coding to distinguish positive (blue), negative (red), and neutral (gray) associations.\u003c/p\u003e\u003cp\u003eWhen fragments exhibit synergistic effects, their combination can lead to greater toxicity than a single structural alert (SA). For instance, while sulfonic acid groups and their derivatives are non-toxic on their own, their toxicity rises when combined with other toxic groups such as thiophene, benzo five-membered diazepine-containing heterocyclic (B5MDH) groups and phosphate-like groups, etc., as indicated by the different red dots in Figs.\u0026nbsp;4b and 4c, where the effect of the B5MDH groups is particularly pronounced. Notably, in our dataset of 144 molecules containing the benzimidazole group (specific fragment of B5MDH groups) (Fig.\u0026nbsp;4d), 127 (approximately 88%) fall into toxicity category I. Further investigation revealed a clear synergistic effect when benzimidazole coexists with a trifluoromethyl group, namely 2-(trifluoromethyl)benzimidazole (2TFMBI), which has an LD\u003csub\u003e50\u003c/sub\u003e of 28 mg/kg, categorizing it as danger. Among the 128 molecules containing this group in the dataset, the logarithmic toxicity values range from − 0.70 to 2.38, all within category I. Certain fragments, like anhydrides (non-toxic on their own) and pyrazoles, become toxic or exhibit increased toxicity when attached to a benzene ring (Table S16). The attachment increased the negative charge of the fragment (\u003cem\u003eq\u003c/em\u003e\u003csub\u003eA\u003c/sub\u003e), thereby enhancing the reactivity of electrophiles. This may lead to covalent bonding with endogenous nucleophilic proteins (such as receptors and enzymes) and nucleic acids (including DNA and RNA), resulting in covalent binding to electron-rich sites and potentially causing adverse outcomes.\u003c/p\u003e\u003cp\u003eIn contrast, some fragment pairs exhibit antagonistic effects, where the coexistence of two groups (at least one of which is toxic) leads to reduced toxicity. For example, sulfonic acid combined with halogenated benzene or azo groups is illustrated in the blue circles of Fig.\u0026nbsp;4b (the blue spherical section represents the molecules include trifluoromethyl-benzimidazole, a highly toxic fragment). The coexistence of furan and nitro groups can decrease the toxicity of a molecule to non-toxic levels (Log(LD\u003csub\u003e50\u003c/sub\u003e) \u0026gt; 3). These observations highlight that the toxicological effects of fragments are not solely additive or subtractive. The toxicity of a molecule can either be potentiated or mitigated depending on its chemical environment. Thus, ML models developed in this work may play an important role to discover such a non-linear relationship, facilitating the development of safer and more sustainable chemicals.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eSimilar molecule analysis using eFragD descriptors\u003c/b\u003e\u003c/p\u003e\u003cp\u003eUnderstanding toxicity variations among structurally similar compounds remains a key challenge in molecular risk assessment. To evaluate the sensitivity of the eFragD-RFR model in capturing subtle structure–toxicity relationships, small datasets of structural analogs were constructed. These included compounds sharing the same scaffold but differing in fragment composition, attachment positions, or localized electronic environments.\u003c/p\u003e\u003cp\u003eA simple example involves nitrobenzene-based molecules, where the replacement of fragment causes changes in toxicity. Nitrobenzene is known as a toxophore for its widespread mutagenicity, genotoxicity, and carcinogenicity, primarily attributed to the electrophilic nature of the nitro group.\u003csup\u003e[27]\u003c/sup\u003e However, not all molecules containing a nitrobenzene group are toxic. Various features we defined were analyzed to assess their correlation with toxicity, and the five most relevant descriptors are illustrated in Fig.\u0026nbsp;5a. A strong relationship was observed between the atomic charge difference (∆\u003cem\u003eq\u003c/em\u003e) and toxicity values (Log(LD\u003csub\u003e50\u003c/sub\u003e)). In Fig.\u0026nbsp;5b, the values of ∆\u003cem\u003eq\u003c/em\u003e for toxic and non-toxic compounds are highly clustered (red and blue dots), with a boundary at ∆\u003cem\u003eq\u003c/em\u003e = 0.6. For nitrobenzene derivatives, as shown in Fig.\u0026nbsp;5c, the charges of substituents (\u003cem\u003eq\u003c/em\u003e\u003csub\u003eA\u003c/sub\u003e) and ∆\u003cem\u003eq\u003c/em\u003e of a molecule alone were sufficient to predict toxicity with high accuracy (RFR model). These findings were also consistent with the fact that electron-donating substituents heighten the electron density in the nitrobenzene ring, yielding an increased reactive electrophilicity. This, in turn, makes the molecule more susceptible to oxidation by the cytochrome P450 enzyme system (CYP450), producing electrophilic metabolites that contribute to toxicity, particularly hepatotoxicity.\u003c/p\u003e\u003cp\u003eAnother sub-dataset was analyzed to visualize the toxicity variation among similar molecules with the same A groups (A = ester/amide group) but different fragment charges, as shown in Figs.\u0026nbsp;5d-5f. In Fig.\u0026nbsp;5d, the Sure Independence Screening and Sparsifying Operator (SISSO) method\u003csup\u003e\u003cspan citationid=\"CR34\" class=\"CitationRef\"\u003e34\u003c/span\u003e\u003c/sup\u003e was employed to construct new descriptors (\u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e and \u003cem\u003eD\u003c/em\u003e\u003csub\u003e2\u003c/sub\u003e) derived from the four toxicity-relevant eFragD features. Notably, SISSO descriptor \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e exhibited a significantly stronger positive correlation with molecular toxicity compared to eFragD features (Fig.\u0026nbsp;5e). And the RFR model achieved a perfect performance with the \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e and \u003cem\u003eD\u003c/em\u003e\u003csub\u003e2\u003c/sub\u003e descriptors (Fig.\u0026nbsp;5f). The expression of \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e and molecular analysis revealed that parameters such as \u003cem\u003eM\u003c/em\u003e\u003csub\u003eHA\u003c/sub\u003e, \u003cem\u003eq\u003c/em\u003e\u003csub\u003eA\u003c/sub\u003e, and \u003cem\u003eq\u003c/em\u003e\u003csub\u003emin\u003c/sub\u003e enhance the \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e values, while the branching carbon number (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eC\u003c/sub\u003e) has an inverse effect. Notably, the charge values of fragment A (\u003cem\u003eq\u003c/em\u003e\u003csub\u003eA\u003c/sub\u003e) and \u003cem\u003eq\u003c/em\u003e\u003csub\u003emin\u003c/sub\u003e were negative. These findings indicate that increasingly negative values of \u003cem\u003eq\u003c/em\u003e\u003csub\u003eA\u003c/sub\u003e and \u003cem\u003eq\u003c/em\u003e\u003csub\u003emin\u003c/sub\u003e correlate with enhanced nucleophilic reactivity in molecules, which is associated with elevated acute toxicity. Moreover, the observed escalation in toxicity corresponding to increased \u003cem\u003eN\u003c/em\u003e\u003csub\u003eC\u003c/sub\u003e values can be attributed to augmented hydrophobicity, which amplifies bioaccumulation potential and facilitates interactions with biological targets.\u003c/p\u003e\u003cp\u003eThese results demonstrate that eFragD’s atom- and fragment-level descriptors effectively capture subtle structural variations, enabling accurate and reliable toxicity prediction. This structural sensitivity reinforces eFragD’s value in supporting informed, sustainable chemical design.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003e\u003cb\u003eCross-domain fragment projection of cyanocidal activity\u003c/b\u003e\u003c/p\u003e\u003cp\u003eBuilding on validated predictions of low rodent toxicity, the eFragD framework was extended to identify cyanocides with minimal ecological risk. This cross-domain application leverages the structural similarity between toxicological mechanisms in animals and cyanobacteria, which often share reactive functional groups and metabolic pathways (Supporting Section 4, Tables S17 and S18). Based on extensive acute toxicity data, the eFragD approach has been developed to focus on molecular fragments. This raises the question of whether eFragD descriptors can effectively screen for cyanocides, thus further demonstrating its potential as a robust, interpretable tool for sustainable water quality management.\u003c/p\u003e\u003cp\u003eOur analysis included 50 compounds with potential cyanocidal activity, comprising both our experimentally confirmed cyanocides (Supporting Sections 4.1–4.4, Tables S19, Figures S11 and S12) and those reported in the literature (Table S20)\u003csup\u003e\u003cspan additionalcitationids=\"CR36 CR37 CR38 CR39\" citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e–\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. The descriptors were extracted by eFragD method, including the number of low rodent-toxicity fragments (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eFrag\u003c/sub\u003e), fragment density per unit area (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eFrag\u003c/sub\u003e/Å\u003csup\u003e2\u003c/sup\u003e) and the number of aromatic rings (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eAR\u003c/sub\u003e). Some physicochemical properties (e.g., LogP, \u003cem\u003eα\u003c/em\u003e, and MPI) were also assessed to determine their inhibitory effects at 96 h (IE\u003csub\u003e96 h\u003c/sub\u003e, %) on cyanobacterial growth with significant inhibitory effect (SIE) or no observed effect (NOE) (Fig.\u0026nbsp;6) (specific analysis seen in Supporting Section 4.5).\u003c/p\u003e\u003cp\u003eCorrelation analysis identified six descriptors that are most strongly associated with cyanocidal activity (Fig. S13). These descriptors were combined into a composite feature, \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e\u003csup\u003eCyanos\u003c/sup\u003e, by SISSO method. Using a random forest (RF) binary classification model (Classification criteria are detailed in Supporting Section 4.5), \u003cem\u003eD\u003c/em\u003e\u003csub\u003e1\u003c/sub\u003e\u003csup\u003eCyanos\u003c/sup\u003e descriptor was found to effectively distinguish SIE from NOE (Fig.\u0026nbsp;6a). Further analysis (Figs.\u0026nbsp;6b and S13) of distinct molecular properties revealed that inhibitory effects of molecules tend to exhibit an inverse relationship with LogP and \u003cem\u003eN\u003c/em\u003e\u003csub\u003eAR\u003c/sub\u003e, but positively related to \u003cem\u003eN\u003c/em\u003e\u003csub\u003eFrag\u003c/sub\u003e/Ų and MPI. Hydrophilic compounds with stronger hydrogen-bonding capacity or multiple bioactive fragments, engaging in electrostatic interactions with biomolecules (such as enzymes and proteins), tend to exhibit more effective cyanocidal activity.\u003c/p\u003e\u003cp\u003e\u003cb\u003eMolecule-protein interaction for effective fragment analysis\u003c/b\u003e\u003c/p\u003e\u003cp\u003eMolecular docking studies were conducted to analyze the contribution of molecular fragments and their interactions to the binding efficiency with proteins, shedding light on the inhibition mechanism at an atomic level. Non-heme iron proteins, which are crucial for cyanobacterial growth in electron transport and redox reactions during photosynthesis, contain non-heme iron as a key component of many oxidases and dioxygenases, and have been shown to serve as strong binding sites for small molecules\u003csup\u003e\u003cspan additionalcitationids=\"CR42\" citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e–\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e. Therefore, molecular protein-binding affinities were assessed using the AutoDock Tools 4.2\u003csup\u003e44\u003c/sup\u003e (excluding molecules with LogP \u0026gt; 4, details and analysis seen in Supporting Section 4.6, Figures S14 and S15).\u003c/p\u003e\u003cp\u003eBinding energies prediction (Fig. S14) revealed that most molecules exhibit strong binding affinity with proteins. Statistical analysis in Fig.\u0026nbsp;6c (binding energies corresponding to molecular fragments) highlighted that carboxyl, hydroxyl and ester groups exhibited strong binding affinities with target proteins. The significant interactions with the residues like hydrogen bonds and π-π interactions are observed in Fig. S15. Figure\u0026nbsp;6d presents a quadrant plot that highlights the relationship between rat oral toxicity (Log(LD\u003csub\u003e50\u003c/sub\u003e)) and cyanobacterial inhibitory effect (IE\u003csub\u003e96 h\u003c/sub\u003e). Quadrant I, contains 13 compounds that exhibit both low rodent toxicity (Log(LD\u003csub\u003e50\u003c/sub\u003e) \u0026gt;3.301, LD\u003csub\u003e50\u003c/sub\u003e values listed in Tables S19 and S20) and high cyanocidal activity (IE\u003csub\u003e96 h\u003c/sub\u003e\u0026gt;20%, IE\u003csub\u003e96 h\u003c/sub\u003e values listed in Table S21), such as acetylacetone (Log(LD\u003csub\u003e50\u003c/sub\u003e) = 3.456 \u0026amp; IE\u003csub\u003e96 h\u003c/sub\u003e = 83%), are ideal candidates for cyanocides. These findings highlight the utility of the eFragD-RF model in identifying dual-benefit molecules, which are both environmentally safer and biologically effective.\u003c/p\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis study developed an atomic charge-enhanced molecular fragmentation method, eFragD, that integrates chemically meaningful substructures with electrostatic information derived from DFT calculations. By capturing subtle variations in fragment properties, eFragD improves the sensitivity and interpretability of molecular bioactivity predictions. Our method supports automatic extraction of user-defined low-toxicity or risk-associated fragments, enabling interpretable insight into structure–activity relationships. eFragD-based descriptors, when applied to classification and regression models, demonstrated predictive performance that matches or exceeds established computational tools such as T.E.S.T.. Notably, eFragD facilitated the discovery of environmentally safer chemical candidates by identifying non-toxic motifs, with successful applications in predicting both rodent acute toxicity and cyanocidal activity. The framework further shows promise for sustainable material development, such as organic ligands for hybrid perovskites. By combining predictive accuracy with fragment-level interpretability, eFragD represents a scalable step toward embedding green chemistry principles into predictive toxicology and sustainable water management. Future extensions may incorporate 3D structural descriptors to enhance generalization across broader biological endpoints. eFragD descriptors could also be embedded within deep learning frameworks, allowing for end-to-end learning while retaining interpretable fragment-level contributions.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e\u003cb\u003eAcute toxicity dataset\u003c/b\u003e\u003c/p\u003e\u003cp\u003eWe collected a dataset of rat oral acute toxicity (experimental LD\u003csub\u003e50\u003c/sub\u003e values, mg/kg), consisting of 7,804 data samples. The sampled compounds composed of ten elements (C, H, O, N, F, Cl, Br, I, P, and S). The LD\u003csub\u003e50\u003c/sub\u003e values span a broad range from 0.1 to 7×10\u003csup\u003e4\u003c/sup\u003e mg/kg, which were converted to Log\u003csub\u003e10\u003c/sub\u003e(LD\u003csub\u003e50\u003c/sub\u003e, mg/kg), labeled as Log(LD\u003csub\u003e50\u003c/sub\u003e). The SMILES style was converted into 3D structures through geometry optimizations and frequency calculations using the DFT calculations at the level of M06-2X/6-311G(d, p) method with a water solvent model (IEFPCM) as implemented in Gaussian software\u003csup\u003e\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e\u003c/sup\u003e. The SDD basis set was used for the molecules containing iodine. Hirshfeld atomic charge calculations were performed by Hirshfeld population method through Multiwfn 3.8 software\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e. 1,283 external samples with experimental rat oral LD\u003csub\u003e50\u003c/sub\u003e values were gathered from T.E.S.T. database\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e as an external test set.\u003c/p\u003e\u003cp\u003e\u003cb\u003eDataset curation\u003c/b\u003e\u003c/p\u003e\u003cp\u003eTo ensure data quality, a meticulous multi-step data preparation process was carried out as the following steps: (1) Duplicate records and molecules with conflicting toxicity values were removed, (2) Compounds containing inorganic and organometallic substances, salts, charged species, and mixtures were precluded, (3) Tautomer and compounds with molecular weight more than 800 were excluded, (4) For a specific molecule, if the experimental toxicity value varied in different sources, the compound was eliminated. Finally, a toxicity dataset containing 7,804 chemicals was obtained.\u003c/p\u003e\u003cp\u003e\u003cb\u003eeFragD construction\u003c/b\u003e\u003c/p\u003e\u003cp\u003eThe construction process of eFragD statistcal model is illustrated in Extended Data Fig.\u0026nbsp;1. As depicted in Extended Data Fig.\u0026nbsp;1a, the framework consists of three parts: the construction of a basic fragment-based dictionary, including fragments and atomic charge, the match and extraction of molecular fragments and the generation of molecular features for the downstream task. The rationale of the initial step is mainly to obtain labels of general fragments semantics (divided into rings/chains parts) with clear boundary rules, which could be adjusted for specific tasks, serving as the standardization of eFragD (Table S22). Next, the related information from the input molecular structure files is extracted to match with the compiled basic dictionary. Specifically, this task involves the detection of atomic strings within the input molecule and the collection of their nearest neighbors, including the number (\u003cem\u003eN\u003c/em\u003e\u003csub\u003econnect\u003c/sub\u003e) and type (\u003cem\u003eT\u003c/em\u003e\u003csub\u003econnect\u003c/sub\u003e) of connected atoms. In cases where ring structures are present in a molecule, the number of atoms comprising the ring (\u003cem\u003eN\u003c/em\u003e\u003csub\u003ering\u003c/sub\u003e) is first determined. The collected data is encoded as input, which is then classified according to the basic dictionary. The encoding can be further categorized by considering different local environments around the fragments. Utilizing the collected information, chemically meaningful fragments are identified and quantified, including their types, numbers (\u003cem\u003eN\u003c/em\u003e\u003csub\u003emn\u003c/sub\u003e) and Hirshfeld charges (\u003cem\u003eq\u003c/em\u003e\u003csub\u003emn\u003c/sub\u003e) of the molecular fragments. The structural information is quantified into a matrix A (with dimensions M×N). To complement fragment-level information, several global molecular descriptors were included to capture whole-molecule properties relevant to toxicity. Specifically, descriptors (dimension 1×N) like the fraction of saturated carbons (\u003cem\u003eF\u003c/em\u003e\u003csub\u003esp3\u003c/sub\u003e), the total mass of heavy atoms (\u003cem\u003eM\u003c/em\u003e\u003csub\u003eHA\u003c/sub\u003e), the number of heteroatoms (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eHET\u003c/sub\u003e), the maximum/minimum atomic charges (\u003cem\u003eq\u003c/em\u003e\u003csub\u003emax\u003c/sub\u003e/\u003cem\u003eq\u003c/em\u003e\u003csub\u003emin\u003c/sub\u003e) and the charge difference (∆\u003cem\u003eq\u003c/em\u003e), the number of quaternary/tertiary/secondary/primary carbon (\u003cem\u003eN\u003c/em\u003e\u003csub\u003eC\u003c/sub\u003e/\u003cem\u003eN\u003c/em\u003e\u003csub\u003eCH\u003c/sub\u003e/\u003cem\u003eN\u003c/em\u003e\u003csub\u003eCH2\u003c/sub\u003e/\u003cem\u003eN\u003c/em\u003e\u003csub\u003eCH3\u003c/sub\u003e), also calculated via the eFragD method, serve as additional features for the machine learning model. Each compound was represented by a hybrid feature set combining chemically interpretable fragment-based features (\u003cem\u003eN\u003c/em\u003e\u003csub\u003emn\u003c/sub\u003e × \u003cem\u003eq\u003c/em\u003e\u003csub\u003emn\u003c/sub\u003e) and global structural attributes, forming a unified input vector. This strategy enabled the model to simultaneously capture local electrophilic character (from DFT-derived fragment charges) and overall physicochemical context (from global features), enhancing the robustness and generalizability of toxicity prediction.\u003c/p\u003e\u003cp\u003eExtended Data Fig.\u0026nbsp;1b presents two examples illustrating the identification rules for eFragD construction. In the upper section of Extended Data Fig.\u0026nbsp;1b, phosphate fragments are auto-discriminated by recognizing keywords derived from the eFragD basic dictionary. These rules primarily involve the evaluation of \u003cem\u003eN\u003c/em\u003e\u003csub\u003econnect\u003c/sub\u003e values (where 1 and 2 denote oxygen atoms connected by single and double bonds, respectively) and \u003cem\u003eT\u003c/em\u003e\u003csub\u003econnect\u003c/sub\u003e (phosphate) of oxygen strings, in combination with \u003cem\u003eN\u003c/em\u003e\u003csub\u003econnect\u003c/sub\u003e of 4 and \u003cem\u003eT\u003c/em\u003e\u003csub\u003econnect\u003c/sub\u003e of oxygen in phosphate string. Notice that based on this fragmentation, only the oxygen atom possessing a C/H atom (R\u003csub\u003e4\u003c/sub\u003e boundary rules) on other sides of the phosphate group is considered to be valid. Emphasizing the implementation of a priority scheme is crucial, which prioritizes more intricate and explicit fragments. As exemplified by the lower section of Extended Data Fig.\u0026nbsp;1b, some amide-like structures are firstly detected as simple amide fragments at the lower layers and then amalgamates into higher-level entities (in number of atoms), which are given greater priority. The more complex fragments were also recorded as the generic one, as these generic fragments are often sufficient to explain molecular properties and are more frequently present to match test compounds. The same fragments (e.g., amide) could also be further categorized based on different surrounding chemical environments, including double bonds, cyclic structures, and benzene rings. 155 eFragD fragments were collected in Extended Data Fig.\u0026nbsp;2, and the boundary rules are detailed in Supporting Section 5 and Table S22.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll original toxicology datasets used in our study are available from the ToxiVerse database\u003csup\u003e45\u003c/sup\u003e, the web of PubChem\u003csup\u003e46\u0026nbsp;\u003c/sup\u003eand the T.E.S.T. database, the corresponding links are as follows: ToxiVerse, https://toxiverse.com/toxdata; PubChem, https://pubchem.ncbi.nlm.nih.gov/; T.E.S.T., https://www.epa.gov/comptox-tools/toxicity-estimation-software-tool-test.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe procedure and a skeleton Python script for feature analysis using the eFragD method are described in Supporting Section 6 and Fig. S16. A simplified and executable Python script, along with a quick_demo for running a sample fragmentation workflow, is included in the eFragD_submission_code_package to demonstrate the key feature extraction. The full source code used for model training and evaluation is available from the corresponding author upon reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFinancial support for this work was provided by the National Key Research and Development Program of China (2023YFB3813001), the National Natural Science Foundation of China (grant no. 22033004, 22373049, 22336002, 22425603), the Natural Science Foundation of Jiangsu Province (BK20232012), and the Jiangsu Funding Program for Excellent Postdoctoral Talent (grant no. 2023ZB655).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor information\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThese authors contributed equally: Chen-Chen Zhao, Shaoyi Hou.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthors and affiliations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eState Key Laboratory of Coordination Chemistry, Key Laboratory of Mesoscopic Chemistry of Ministry of Education, School of Chemistry and Chemical Engineering, Nanjing University, Nanjing, Jiangsu 210023, P. R. China\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eChen-Chen Zhao, Shaoyi Hou, Cheng Fu, Guoqiang Wang and Jing Ma\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eState Key Laboratory of Pollution Control and Resource Reuse, School of Environment, Nanjing University, Nanjing, Jiangsu 210023, P. R. China\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003ePeng Peng, Shujuan Zhang\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eState Key Laboratory of Chemo/Biosensing and Chemometrics, Advanced Catalytic Engineering Research Center of the Ministry of Education, College of Chemistry and Chemical Engineering, Hunan University, Changsha, Hunan 410082, P. R. China\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eJian-Jun Feng\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eC. Zhao designed the study, curated the data, and wrote the manuscript. S. Hou developed the code of eFragD method. C. Fu performed statistical analyses. P. Peng conducted the cyanobacterial inhibition experiments. G. Wang and J. Feng provided the new molecules for experiments. S. Zhang and J. Ma acquired funding, supervised the research, contributed to writing, review and editing. All co-authors contributed to providing feedback on the manuscript and figures.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eJohnson, A. C. et al. Sumpter Learning from the past and considering the future of chemicals in the environment. \u003cem\u003eScience\u003c/em\u003e \u003cstrong\u003e367,\u003c/strong\u003e 384-387 (2020).\u003c/li\u003e\n\u003cli\u003eZhang, Q. et al. Cyanobacterial blooms contribute to the diversity of antibiotic-resistance genes in aquatic ecosystems. \u003cem\u003eCommun. Biol.\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, 737 (2020).\u003c/li\u003e\n\u003cli\u003eFeng, L. et al. Harmful algal blooms in inland waters. \u003cem\u003eNat. Rev. Earth Environ. \u003c/em\u003e\u003cstrong\u003e5\u003c/strong\u003e, 631-644 (2024).\u003c/li\u003e\n\u003cli\u003eZhang, Y. et al. Chemical contaminants in blood and their implications in chronic diseases. \u003cem\u003eJ. Hazard. Mater.\u003c/em\u003e \u003cstrong\u003e466\u003c/strong\u003e, 133511 (2024).\u003c/li\u003e\n\u003cli\u003eZou, H. et al. Continuing large-scale global trade and illegal trade of highly hazardous chemicals. \u003cem\u003eNat. Sustain.\u003c/em\u003e \u003cstrong\u003e6\u003c/strong\u003e, 1394-1405 (2023).\u003c/li\u003e\n\u003cli\u003eU. S. Environmental Protection Agency. User\u0026rsquo;s guide for T.E.S.T. (version 5.1) (toxicity estimation software tool): a program to estimate toxicity from molecular structure. https://www.epa.gov/comptox-tools/toxicity-estimation-software-tool-test (2020).\u003c/li\u003e\n\u003cli\u003eU. S. Environmental Protection Agency. Ecological structure-activity relationships program (ECOSAR) operation manual v2.2. https://www.epa.gov/tsca-screening-tools/ecological-structure-activity-relationships-ecosar-predictive-model (2022).\u003c/li\u003e\n\u003cli\u003eBai, C.\u003cem\u003e \u003c/em\u003eet al. Machine Learning Enabled Drug Induced Toxicity Prediction. \u003cem\u003eAdv. Sci.\u003c/em\u003e \u003cstrong\u003e12\u003c/strong\u003e, 2413405 (2025).\u003c/li\u003e\n\u003cli\u003eKetkar, R.\u003cem\u003e \u003c/em\u003eet al. A Benchmark Study of Graph Models for Molecular Acute Toxicity Prediction. \u003cem\u003eInt. J. Mol. Sci.\u003c/em\u003e \u003cstrong\u003e24\u003c/strong\u003e, 11966 (2023).\u003c/li\u003e\n\u003cli\u003eGilmer, J. et al. Neural Message Passing for Quantum Chemistry. In \u003cem\u003eProc. 34th International Conference on Machine Learning\u003c/em\u003e arXiv:1704.01212 (2017).\u003c/li\u003e\n\u003cli\u003eXiong, Z.\u003cem\u003e \u003c/em\u003eet al. Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism. \u003cem\u003eJ. Med. Chem.\u003c/em\u003e \u003cstrong\u003e63\u003c/strong\u003e, 8749-8760 (2019).\u003c/li\u003e\n\u003cli\u003eYing, R. et al. GNNExplainer: Generating Explanations for Graph Neural Networks. In \u003cem\u003eProc. 33rd International Conference on Neural Information Processing Systems\u003c/em\u003e arXiv:1903.03894 (2019).\u003c/li\u003e\n\u003cli\u003ePatlewicz, G. et al. An evaluation of the implementation of the cramer classification scheme in the Toxtree software. \u003cem\u003eSAR QSAR Environ. Res.\u003c/em\u003e \u003cstrong\u003e19\u003c/strong\u003e, 495-524 (2008).\u003c/li\u003e\n\u003cli\u003eCortes-Ciriano, I. Bioalerts: a python library for the derivation of structural alerts from bioactivity and toxicity data sets. \u003cem\u003eJ. Cheminf.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e, 13-18 (2016).\u003c/li\u003e\n\u003cli\u003eDurant, J. L. et al. Reoptimization of MDL keys for use in drug discovery. \u003cem\u003eJ. Chem. Inf. Comput. Sci.\u003c/em\u003e \u003cstrong\u003e42\u003c/strong\u003e, 1273-1280 (2002).\u003c/li\u003e\n\u003cli\u003eLewell, X. Q. et al. RECAP--retrosynthetic combinatorial analysis procedure: a powerful new technique for identifying privileged molecular fragments with useful applications in combinatorial chemistry. \u003cem\u003eJ. Chem. Inf. Comput.\u003c/em\u003e\u003cem\u003e Sci.\u003c/em\u003e \u003cstrong\u003e38\u003c/strong\u003e, 511-522 (1998).\u003c/li\u003e\n\u003cli\u003eDegen, J. et al. On the art of compiling and using \u0026apos;drug-like\u0026apos; chemical fragment spaces. \u003cem\u003eChemMedChem\u003c/em\u003e \u003cstrong\u003e3\u003c/strong\u003e, 1503-1507 (2008).\u003c/li\u003e\n\u003cli\u003eVangala, S. R. et al. pBRICS: a novel fragmentation method for explainable property prediction of drug-like small molecules. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e\u003cem\u003e \u003c/em\u003e\u003cstrong\u003e63\u003c/strong\u003e, 5066-5076 (2023).\u003c/li\u003e\n\u003cli\u003eLiu, Z.\u003cem\u003e \u003c/em\u003eet al. Machine learning on properties of multiscale multisource hydroxyapatite nanoparticles datasets with different morphologies and sizes. \u003cem\u003enpj Comput. Mater.\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 142 (2021).\u003c/li\u003e\n\u003cli\u003eLu, T. et al. Multiwfn: a multifunctional wavefunction analyzer. \u003cem\u003eJ. Comput. Chem.\u003c/em\u003e \u003cstrong\u003e33\u003c/strong\u003e, 580-592 (2012).\u003c/li\u003e\n\u003cli\u003eLiu, S. et al. Information conservation principle determines electrophilicity, nucleophilicity, and regioselectivity. \u003cem\u003eJ. Phys. Chem. A\u003c/em\u003e \u003cstrong\u003e118\u003c/strong\u003e, 3698-3704 (2014).\u003c/li\u003e\n\u003cli\u003eSushko, I. et al. ToxAlerts: a web server of structural alerts for toxic chemicals and compounds with potential adverse reactions. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e \u003cstrong\u003e52\u003c/strong\u003e, 2310-2316 (2012).\u003c/li\u003e\n\u003cli\u003eChen, J. et al. Automated machine learning of interfacial interaction descriptors and energies in metal-catalyzed N\u003csub\u003e2\u003c/sub\u003e and CO\u003csub\u003e2\u003c/sub\u003e reduction reactions. \u003cem\u003eLangmuir\u003c/em\u003e \u003cstrong\u003e41\u003c/strong\u003e, 3490-3502 (2025).\u003c/li\u003e\n\u003cli\u003eLandrum, G. RDKit: Open-source cheminformatics software. https://www.rdkit.org/ (2006).\u003c/li\u003e\n\u003cli\u003eRogers, D. et al. Extended connectivity fingerprints. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e \u003cstrong\u003e50\u003c/strong\u003e, 742-754 (2010).\u003c/li\u003e\n\u003cli\u003eLundberg S. et al. A unified approach to interpreting model predictions. In \u003cem\u003eProc. 31st International Conference on Neural Information Processing Systemss\u003c/em\u003e\u003cem\u003e \u003c/em\u003evol.\u003cem\u003e \u003c/em\u003e4768-4777 (2017).\u003c/li\u003e\n\u003cli\u003eLi, X. et al. In silico prediction of chemical acute oral toxicity using multi-classification methods. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e \u003cstrong\u003e54\u003c/strong\u003e, 1061-1069 (2014).\u003c/li\u003e\n\u003cli\u003eDi Stefano, M. et al. VenomPred 2.0: a novel in silico platform for an extended and human interpretable toxicological profiling of small molecules. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e \u003cstrong\u003e64\u003c/strong\u003e, 2275-2289 (2023).\u003c/li\u003e\n\u003cli\u003eGadaleta, D. et al. SAR and QSAR modeling of a large collection of LD\u003csub\u003e50\u003c/sub\u003e rat acute oral toxicity data. \u003cem\u003eJ. Cheminf.\u003c/em\u003e \u003cstrong\u003e11, \u003c/strong\u003e58-73 (2019).\u003c/li\u003e\n\u003cli\u003eMachhar, J. et al. Computational prediction of toxicity of small organic molecules: state-of-the-art. \u003cem\u003ePhys. Sci. Rev.\u003c/em\u003e \u003cstrong\u003e4\u003c/strong\u003e,20190009(2019).\u003c/li\u003e\n\u003cli\u003eFrisch, M. J. et al. Gaussian 16 Rev. C.01. (Wallingford, CT, 2016).\u003c/li\u003e\n\u003cli\u003eJia, Q. et al. Fast prediction of lipophilicity of organofluorine molecules: deep learning-derived polarity characters and experimental tests. \u003cem\u003eJ. Chem. Inf. Model.\u003c/em\u003e \u003cstrong\u003e62\u003c/strong\u003e, 4928-4936 (2022).\u003c/li\u003e\n\u003cli\u003eWu, Y.\u003cem\u003e \u003c/em\u003eet al. Universal machine learning aided synthesis approach of two-dimensional perovskites in a typical laboratory. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e15\u003c/strong\u003e, 138 (2024).\u003c/li\u003e\n\u003cli\u003eOuyang, R. et al. SISSO: a compressed-sensing method for identifying the best low-dimensional descriptor in an immensity of offered candidates. \u003cem\u003ePhys. Revi. Mater.\u003c/em\u003e\u003cem\u003e \u003c/em\u003e\u003cstrong\u003e2\u003c/strong\u003e, 083802 (2018).\u003c/li\u003e\n\u003cli\u003eRastogi, R. P. et al. Bloom dynamics of cyanobacteria and their toxins: environmental health impacts and mitigation strategies. \u003cem\u003eFront. Microbiol.\u003c/em\u003e \u003cstrong\u003e6\u003c/strong\u003e, 1254(2015).\u003c/li\u003e\n\u003cli\u003eBatterton, J. et al. Anilines selective toxicity to blue-green algae. \u003cem\u003eScience\u003c/em\u003e\u003cem\u003e \u003c/em\u003e\u003cstrong\u003e199\u003c/strong\u003e, 1068-1070 (1978).\u003c/li\u003e\n\u003cli\u003eJančula, D. et al. Critical review of actually available chemical compounds for prevention and management of cyanobacterial blooms. \u003cem\u003eChemosphere\u003c/em\u003e \u003cstrong\u003e85\u003c/strong\u003e, 1415-1422 (2011).\u003c/li\u003e\n\u003cli\u003eNakai, S. et al. \u003cem\u003eMyriophyllum spicatum\u003c/em\u003e-released allelopathic polyphenols inhibiting growth of blue-green algae Microcystis aeruginosa. \u003cem\u003eWater Res.\u003c/em\u003e\u003cem\u003e \u003c/em\u003e\u003cstrong\u003e34\u003c/strong\u003e, 3026-3032 (2000).\u003c/li\u003e\n\u003cli\u003eWei, P. et al. Efficient inhibition of cyanobacteria M. aeruginosa growth using commercial food-grade fumaric acid. \u003cem\u003eChemosphere\u003c/em\u003e \u003cstrong\u003e301\u003c/strong\u003e,134659-134666(2022).\u003c/li\u003e\n\u003cli\u003eYilimulati, M. et al. Regulation of photosynthesis in bloom-forming cyanobacteria with the simplest \u0026beta;-diketone. \u003cem\u003eEnviron. Sci. Technol.\u003c/em\u003e\u003cem\u003e \u003c/em\u003e\u003cstrong\u003e55\u003c/strong\u003e, 14173-14184 (2021).\u003c/li\u003e\n\u003cli\u003eGonz\u0026aacute;lez, A. et al. Pivotal role of iron in the regulation of cyanobacterial electron transport. \u003cem\u003eAdv. Microb. Physiol.\u003c/em\u003e\u003cem\u003e \u003c/em\u003e\u003cstrong\u003e68\u003c/strong\u003e, 169-217 (2016).\u003c/li\u003e\n\u003cli\u003eGao, H. et al. The diversity and applications of microbial iron metabolism and iron-containing proteins. \u003cem\u003eCommun. Biol.\u003c/em\u003e \u003cstrong\u003e8\u003c/strong\u003e,177-179(2025).\u003c/li\u003e\n\u003cli\u003eSchalk, I. J. Bacterial siderophores: diversity, uptake pathways and applications. \u003cem\u003eNat. Rev. Microbiol.\u003c/em\u003e \u003cstrong\u003e23\u003c/strong\u003e, 24-40 (2024).\u003c/li\u003e\n\u003cli\u003eMorris, G. M. et al. AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. \u003cem\u003eJ. Comput. Chem\u003c/em\u003e. \u003cstrong\u003e30\u003c/strong\u003e, 2785-2791 (2009).\u003c/li\u003e\n\u003cli\u003eRusso, D. P. et al. Nonanimal models for acute toxicity evaluations: applying data-driven profiling and read-across. Environ. Health Perspect. \u003cstrong\u003e127\u003c/strong\u003e, 47001 (2019).\u003c/li\u003e\n\u003cli\u003eKim, S. et al. PubChem 2025 update. \u003cem\u003eNucleic Acids Res\u003c/em\u003e.\u003cem\u003e \u003c/em\u003e\u003cstrong\u003e53\u003c/strong\u003e, 1516-1525 (2025).\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"State Key Laboratory of Coordination Chemistry, School of Chemistry and Chemical Engineering, Collaborative Innovation Center of Advanced Microstructures, Nanjing University, Nanjing 210023, P. R. China.","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Toxicology, Computational Chemistry, Fragment-Based Machine Learning, DFT-Derived Descriptors, Interdisciplinary risk assessment","lastPublishedDoi":"10.21203/rs.3.rs-7617003/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7617003/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eDeveloping environmentally safe and sustainable chemicals is essential for maintaining the ecological integrity. Here we introduced a new molecular representation, atomic charge enhanced Fragment Dictionary (eFragD), to quantify fragment-level contributions to molecular biological effects, including acute toxicity and cyanocidal activity. Unlike conventional fragment-based approaches, eFragD incorporates density functional theory (DFT)-derived atomic charge information to enhance structural sensitivity, improving predictive specificity and interpretability. Applied to a dataset of 7,804 compounds, the framework identified 19 high-risk and 18 eco-compatible fragments, which were validated on 1,400 external chemicals, including antibiotics, new compounds and organic ligands for hybrid perovskite materials. This approach enables early-stage chemical screening for both biological safety and ecological compatibility, supporting sustainable materials design. Notably, in data-scarce scenarios, eFragD effectively prioritized low-toxicity cyanocides, demonstrating its utility in balancing environmental efficacy with long-term sustainability. This work bridges computational toxicology and sustainability science, providing a scalable framework for green chemistry, water quality protection, and evidence-based chemical regulation.\u003c/p\u003e","manuscriptTitle":"A transferable fragment dictionary for eco-safety and environmental risk assessment","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-09-16 10:38:43","doi":"10.21203/rs.3.rs-7617003/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"8ae5ba1e-163c-4c16-88aa-6839aeb57dfb","owner":[],"postedDate":"September 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":54709859,"name":"Toxicology"},{"id":54709860,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-09-16T10:38:44+00:00","versionOfRecord":[],"versionCreatedAt":"2025-09-16 10:38:43","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7617003","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7617003","identity":"rs-7617003","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00