Molecular-substructure Deep Autoencoders Cluster Biomolecules into Novel Band-Shaped Substructure-Distinguished Bioactivity Clusters in 3D Latent Space

preprint OA: closed
Full text JSON View at publisher
Full text 116,581 characters · extracted from preprint-html · click to expand
Molecular-substructure Deep Autoencoders Cluster Biomolecules into Novel Band-Shaped Substructure-Distinguished Bioactivity Clusters in 3D Latent Space | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Molecular-substructure Deep Autoencoders Cluster Biomolecules into Novel Band-Shaped Substructure-Distinguished Bioactivity Clusters in 3D Latent Space YING TAN, Huazhang Ying, Xiang Wu, Chu Qin, Likun Zhang, Zhicheng Du, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6755378/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract Unsupervised deep autoencoders (DAEs) are useful for data clustering and visualization. DAE-derived data clusters are typically visualized by dimensionality reduction methods, which have some degree of visual distortions that pose difficulties in revealing intrinsic cluster patterns. Here, we developed substructure-based molecular-fingerprint DAEs (MolF-DAEs) to cluster 1.9 million bioactive molecules (biomolecules) in 3D latent space (3DLSpace), where data clusters can be straightforwardly visualized. MolF-DAEs developed with three established sets of molecular fingerprints consistently cluster biomolecules with 96.1–97.6% reconstruction rate. In 3DLSpace, the biomolecules cluster into novel substructure-distinguished bioactivity-relevant band-shaped clusters. Each cluster is dominated by the biomolecules of specific substructure combinations. These in-cluster biomolecules are of varying molecular structures but frequently form a limited number of bioactivity classes. Our study suggests that unsupervised deep clustering in 3DLSpace is useful for visually revealing the intrinsic data distribution patterns and functionally relevant data clusters. Biological sciences/Chemical biology/Cheminformatics Biological sciences/Chemical biology/Computational chemistry Bioactive molecule DAEs substructure Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Figure 6 Introduction Unsupervised learning such as deep clustering methods has been widely applied in real-life statistical analysis such as pattern recognition 1 , image processing 2 , and knowledge discovery tasks such as bioactive molecule (biomolecules) clustering 3 , 4 , genomics data mining 5 , and disease diagnosis 6 . One application of deep clustering is in drug discovery, where effective clustering of biomolecules with respect to common molecular determinants facilitates the mapping of pharmacological chemical space 7 and the investigation of structure-activity relationships 8 . Various clustering methods have been developed based on the fundamental data features or their linearly/non-linearly transformed variants 9 , such as K-means 10 , 11 , hierarchical clustering 12 and spectral clustering. Moreover, deep autoencoders (DAEs) are highly useful for deep and complex clustering tasks 13 , 14 . Due to the high dimensionality, sparsity and variance of data features, these clustering methods rely on feature representation and dimensionality reduction techniques 14 , 15 . Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbor Embedding (t-SNE) 16 are the most common algorithms used as a preprocessing step to provide useful cluster patterns. Under these methods, the visualization of data clusters and the subsequent analysis may be affected by visual distortion in the low-dimensional space. Minor visual distortions may in some cases affect the quality of cluster analysis. For example, minor differences in biomolecular structures (i.e. minor differences in the separation of the cluster neighbors) may lead to substantial changes in bioactivity targets and bioactivity values 17 . There is a need for effective methods to both cluster data and visualize the undistorted cluster patterns. DAEs 13 , 14 may be potentially explored for data clustering and undistorted visualization. It captures nonlinear relationships of complex patterns while preserving both local and global characteristics 18 , 19 . In order for undistorted visualization of the DAE-derived data clusters, one may consider the construction of DAEs with 3DLSpace, where the data clusters can be straightforwardly visualized in the 3DLSpace without data distortion. A question is whether DAEs can meaningfully cluster data in 3DLSpace. Here we developed molecular fingerprint deep autoencoders (MolF-DAEs) to demonstrate the clustering ability of DAEs in 3DLSpace. We further revealed the DAE-derived cluster patterns of bioactive molecules and discussed their potential implications to drug discovery tasks. MolF-DAEs consist of symmetric fully connected encoders and decoders, which were trained by 1.9 million biomolecules from the ChEMBL database 20 represented by three sets of molecular fingerprints (MFs). The three sets of DAEs are the PubChem molecular fingerprint model (PubChemFPM), MACCS keys fingerprint model (MACCSFPM), and 2D pharmacophore fingerprints model (PharmacoPFPM) 21 (Fig. 1 ). Compared with existing methods, MolF-DAEs do not require additional dimensionality reduction methods. Additionally, we developed a chemical space navigation simulation software Chempack for displaying and analyzing the band cluster landscapes. Rather than solving a specific downstream task prediction, MolF-DAEs aim at mining reliable experimentally obtained high activity data to evaluate activity-related compound spatial and organism target spatial. This method can be migrated to other types of sparse high-dimensional data mining. Results 1. High Accuracy of Deep Autoencoders in Reconstructing Three Fingerprint Feature Maps This work demonstrates the high efficacy of deep autoencoders in reconstructing biomolecule fingerprint feature maps (Fmaps). In order to increase the efficiency, we focus on the molecules presenting preliminary biological activity. In total, a 1.9 million dataset with high bioactivity (Potency Values ≤10 μM) was collected in the ChEMBL database. Three deep autoencoders are constructed to encode three sets of molecular fingerprint features, including PubChemFP and MACCSFP based on SMILES arbitrary target specification (SMARTS), and PharmacoPFP based on pharmacophore. The reconstruction rates for PubChemFPM, MACCSFPM, and PharmacoPFPM reach 97.55%, 96.10% and 97.55% respectively, surpassing the reported reconstruction rates of 95.3%-96.4% from a latent space dimension of 196 in a variational autoencoder (VAE) trained on 250,000 drug-like molecules 4 . It indicates the precision of models in constructing two-dimensional Fmaps, thereby establishing their reliability in interpreting the molecule physicochemical properties, structural fragments, and spatial distributions. Taking the PubChemFP model as an example, the loss values stabilize around 0.0245 after 100 training epochs (Fig.2a). Despite the correlation between the parameter count and the performance of the models, the minimum MSE may not necessarily require the maximum number of parameters (Fig.2b). Compared with SMILES, the Fmaps are of higher distinguishment among molecules. As the MSE decreases, the visual improvement in the reconstruction of Fmaps becomes apparent (Fig.2c, Supplementary Fig.1-6). Despite the low dimensional vector cause difficulties in accurately reconstructing fingerprints 22 , the majority of the original data characteristics preserved post-reconstruction, barring slight deviations in some local features. This attributed to the extensive training data and feature representation methodologies. 2. Three sets of molecular fingerprints FP-DAEs exhibit band-shaped clustering patterns in 3DLSpace It is intriguing to see the distribution landscape in latent space based on substructures and pharmacophores. Overall, these distributions are orderly and highly consistent with human knowledge. MolF-DAEs achieve optimal distribution in 3DLSpace as training progresses (Fig.3a-c). It exhibits a discontinuous spatial distribution with distinct boundaries. Internally, molecules are arranged into bands and each of them originates from a common central region. This arrangements was observed in nonlinear algorithms such as UMAP, representing mappings dominated by the strong effects of major gradient features 23,24 . It is observed for the first time in DAEs. The latent space exhibits distinct, island-like regions—dense clusters where specific molecular substructures or pharmacophore labels are uniquely enriched. These regions emerge as training progresses, separating from the general molecular distribution to form cohesive zones, each representing variations of a common molecular feature. Target type is a molecular feature characterization received widespread attention 25 . It’s reports that over 50% of drug design targets are concentrated in four categories, including kinase, protease, nuclear receptor and G-protein coupled receptor (GPCR) 26 . However, these target families cover only 1.45%-6.42% of experimental verified biomolecules in the 1.9 million compound database. This indicates that the space of biomolecules remains vast. In MolF-DAEs, without any prior target information, these targets naturally appear in relatively isolated regions (Fig.3d-g). For example, in some of these regions, the concentrations of specific targets reached the following values: kinase (71%), protease (71%), and GPCR (54%). Therefore, MolF-DAEs demonstrate exceptional effectiveness in identifying target-specific clusters. Within the PubChemFPM, kinase inhibitors are concentrated in the upper region in long islands, while protease inhibitors are concentrated in the lower region. GPCR has fewer known drugs and more diverse natural ligands 27 . The clustering of GPCR appears dispersed and concentrated in short islands (Fig.3d, Supplementary Fig.7). The three most common literature-reported molecular fingerprint features are used (Supplementary Figs. 7-9). Each of them exhibits unique clustering patterns. PubChemFPM captures the substructural features of molecules, with target-specific bands with the clearest separation. In contrast, MACCSFPM mainly focuses on the overall structure of molecules, closely connected among clusters. PharmacoPFPM mainly describes the position, spatial relationship, and interaction pattern of pharmacophores in molecules. It exhibits many distinctly isolated short islands in addition to the band structures. The performance of downstream tasks is related to factors like fingerprint types, dimensions, compression, and redundancy 22 . 3. Bands exhibit significant differences in the potential biotherapeutic targets within the 3DLSpace The in-cluster biomolecules are of varying molecular structures but frequently form a limited number of bioactivity classes. Significant concentrations are captured in the scale of substructures, physical properties, and targeting families. For subsequent analyses and applications, we manually select six representative bands with clear boundaries spanning the compound space, across all fingerprint channels (band 1-4 PubChemFPM, band 5: MACCSFM, band 6: PharmacoPFPM) and privileged target islands (band 1-2, 5-6 kinase, band 3 protease, band 4 GPCR). Although there is a fair number of confusing ligands (gray), the four major target classes account for up to 34% of points (Fig.4a and Supplementary Data 8). Kinase is a class of targets with relatively conserved binding pockets, causing widespread off-target effects. Kinase inhibitor takes 69.22%-84.01% in bands 1-2, 5-6 (Fig.4a) and significant changes appear among families (Fig.4b) and groups (Fig.4c). In four kinase enriched bands, members of the Janus kinase (JakA) and Receptor tyrosine kinase (RTK) group, such as the Epidermal growth factor receptor (EGFR) and fibroblast growth factor receptor (FGFR) family exhibit highest proportions. The Tec family is uniquely enriched in PubChemFPM band 1. EGFR i nhibitors such as Erlotinib and Gefitinib are developed as anticancer therapeutics, mainly consisting of rings as a basic skeleton 28,29 . They are enriched within the kinase cluster and positioned closely to each other, with a distance of 25.42 . The top 10 families enriched in PharmacoPFPM are more concentrated and closer in position on inter-group and evolutionary trees (Supplementary Fig.11). Evolutionarily closely related kinase families tend to have similar core structures, likely to cause cross-effects. Conversely, the PubChemFP band covers a broader range of families, including a new enrichment inhibitor group with fewer known inhibitor studies such as Tyrosine kinase-like (TKL), Calcium/calmodulin-dependent protein kinase (CAMK) and the remainder 27 . The degree of enrichment varies significantly in remainders. PharmacoPFP band 5 enriches 30.61% of the Calcium/calmodulin-dependent protein kinase group (CAMK) with fewer reported researches 30,31 . It indicates that relevant structures with similar activities are sensitively captured by the model in undistorted clusters. This contributes to novel target inhibitor pattern researches. 4. Core substructures combination explains undistorted band-shape cluster organization There is core structural unity and local residue diversity in sample size, suggesting the high-quality undistorted distribution pattern. Privileged target types exist, while the principle behind target label clustering is limited to FMap-related substructural classes. GPCR binding fragments share a conserved structural scaffold, whereas kinase inhibitors exhibit more variations in the scaffold and substituents. However, distinct GPCR islands with clear boundaries and structural similarities were identified in kinase-enriched band 1 (Fig.4d). 95.21% of molecules in band 1 share hydrogen bond sites N-C-N-C including the GPCR islands, while other GPCR-enriched bands seldom take this substructure (3.23%). In another kinase-enriched band, the N-C-N-C-C-N structure is over twice as prevalent compared to band 1. Molecules in the kinase island typically exhibit higher LogP, while both molecular weight and LogP of biomolecules in the GPCR islands are lower. Thus, MolF-DAEs separate bands based on various substructural or pharmacophore features. This further explains the intrinsic reasons for cluster-target relation. Furthermore, this data-driven clustering contributes to drug novelty evaluation. Molecules contain amide bonds (-CONH-) with hydrogen bonding, and methoxy-substituted phenyl rings are enriched in kinase island 1. The structure is common in various drugs, including anticancer, antibacterial, and anti-inflammatory drugs like Amitriptyline. Methoxy-substituted phenyl rings are relatively easy to introduce in organic synthesis, and methoxy groups increase lipophilicity, affecting drug metabolism and absorption. In kinase island 2, more substituents like methoxy (-OCH3), fluorine (-F), and chlorine (-Cl) are present. Notably, the gray biomolecules are highly likely to be new kinase biomolecules. To explain the observed undistorted clustering, we summarize the common functional structural modes. It provides a basis for binding studies. Structural analysis reveals a widespread substructural pattern. Each band cluster primarily consists of molecules with unique scaffold or substructure combinations, with minimal overlap with other clusters. The kinase-specific substructure mode involves a combination of two elements: core hydrophobic ring elements (blue) and core hydrogen site elements (red) (Fig.5a). These substructures account for less than 15% of the compounds in other target bands (Fig.5c, Supplementary Fig.12b). The linear structures serve as hydrogen donors and acceptors. It is decisive for the overall binding strength and positioning. Two kinds of linear structure are observed, one is the linear framework and its variants, and the other is the Y-shaped frameworks and their variants. These elements like Y-shaped N-O-N, N-S-N, N-O-CN, or L-shaped NC-N-CN, along with core hydrophobic ring elements, are observed in Lapatinib, Imatinib and Sorafenib. Hydrophobic ring elements contain 6-membered or 5-membered carbon rings, which is important for hydrophobicity and aids in overall stabilization. The ring elements are connected by one or more carbon chains to the core elements. The question is which molecular features are crucial for selective kinase targeting and potency. Compared with the overall common background, we counted the frequency of occurrence of these substructures in 1.9 million molecules and that on the band, respectively. In kinase- specific inhibitors bands, the core element combination reaches up to 22.01 fold to background (Fig.5b, supplementary Fig. 9). This indicates that MolF-DAEs highly select this bioactive substructure as an important feature. Overall, in PubChemFPM regions, the linear carbon-nitrogen substructure (NC-N-C) appears in almost all known kinase inhibitors (band 1: 100%, band 2: 99.8%) and other (band1: 95.2%, band2: 95.1%). And most of the kinase inhibitors are connected to at least one six-membered ring (band 1: 62.2%, band 2: 40.4%) or five-membered ring (band 1: 43.4%, band 2: 37.1%). MACCSFPM covers fewer chemical substructures compared to the PubChemFPM, indicating a more concentrated distribution. There is no Y-shaped substructure connected to six-membered rings in band 5. Pharmacophore fingerprint primarily focuses on describing the position, spatial relationships, and interaction patterns of pharmacophores in molecules. The clustering of PharmacoPFPM in 3DLSpace is more pronounced, with relatively high proportions in various feature substructures. Protease is another target type significantly independent in the band. Peptide chains are prevalent, exhibiting variations in C-chain length and element substitutions. In fact, protease inhibitors contest with natural substrates peptides while not being degraded by proteases. Some unidentified ligands with long peptide chains are captured in the band, such as CHEMBL100202 and CHEMBL102898. In terms of property labeling, this is at variance with the distribution of overall physicochemical properties of approved drugs reported 32 . Similar to peptide drug, protease-privilege bands generally have high molecular weights, lower logarithmic partition coefficients (LogP) values (indicating greater hydrophilicity), and higher topological polar surface area (TPSA). For example, 27% of the molecules exhibiting a LogP below 0, which is significantly higher than average for approved drugs (focus on 1-3). The amide bonding increases the polarity of the molecule and the number of hydrogen bonding donors/acceptors. These differences in structure and physicochemical properties tend to have lower quantitative estimates of drug-likeness (QED) values , averaging around 0.1—well below the reported average of 0.35. This underscores the unique structural and drug-forming properties of this cluster 32 .However, these features, while conducive to protease activity, pose additional challenges in pharmacokinetics for molecule in this band, similar to those faced by peptide drugs. There are GPCR concentrated blocks in band 1 and band 4 despite the fact that the number of compounds is less than half that of kinase and protease. In contrast to the kinase conserved pocket, the natural substrates of GPCR are more mixed, including both nucleic acid substrates and peptide substrates. In 3DLSpace, GPCR-specific clusters are mostly found in the form of short islands, with more dispersed molecular clustering. Some are distributed within other target clusters, implying that they have similar functional groups or competing substrates. However, a small number of independent long island GPCR bands still exist in band1. Structures within islands are often very similar in specific lengths of core hydrogen bonding elements (Y-shaped NCO, NCN, NCCO), combined with core hydrophobic ring elements (R and RN) 33 . Islands differ mainly in substituents, such as O/N-rich islands 1 and 2, and F/Cl-rich island 2. GPCRs have binding pockets that vary significantly in nature, with F providing strong liposolubility possibly related to binding to the water transport pocket in the transmembrane region, and O being closely associated with the extracellular region. 5. Relevance and distinction of sub-structural features with respect to the literature-reported privileged pharmacophores and drug-binding mode A question is raised about the relevance of the DAE-captured sub-structural features of an individual band cluster to the selected bioactivities of the band cluster. Through literature-reported kinase structures, the enriched categories are highly consistent with those reported 34 . Some substructural features of the band clusters comprise key frameworks in literature-reported kinase-binding modes of kinase inhibitor drugs, pharmacophores of kinase frequent hitters, and privileged fragments of kinase inhibitors. The L- and Y-shaped regions are usually binding to a stabilized presence of a hinge area. Core hydrogen bonding structures in different lengths allow for the choice of four sites that bind 2 to 3 sites to the hinge region or gatekeepers. Hydrophobic rings, which are larger and segmented into two parts, are more likely to extend into the E0 or back pocket to form stacking interactions. Different regions with unique substructure patterns are selective in their binding conformation. Structural differences in the remaining parts allow binding pockets of different conformations 35 . The reported kinase-binding conformations of band 2 with short/Y-shape hydrogen sites (PDB: 9JI, 3JW, GD9, P06, 6S1, and RXT) are in the front pocket, without the back pocket. Complexes with multiple long-chain hydrogen sites in band 1 (PDB: TZ0, R1L and UCW) occupy both the front and back pockets (Fig.6). The sub-structural characterization provided by MolF-DAEs is closely consistent with the framework of selected bioactivities and the concentration of these biomolecules within the bands. This correlation is notably in line with the finding that kinase-specific fragments can enhance kinase inhibitors by 5-fold 36 , with sub-structures within individual bands exhibiting enrichments exceeding 25-fold. This concurrence also echoes reports indicating that certain specific molecular scaffolds can exhibit activity against multiple target classes. However, the substructural elements captured by MolF-DAEs diverge from the pharmacogenetic and privileged fragments outlined in the literature in one aspect. While MolF-DAEs capture the fundamental substructural elements and their structural variations, which collectively define the framework of pharmacogenetic or privileged fragments, kinase-specific structures documented in literature often manifest as specific combinations of fragments, such as bis-aryl-NH-linked fragments and biphenyl ether scaffolds 37 . Consequently, by assimilating the foundational elements of structural frameworks, MolF-DAEs possess the capacity to capture a broader spectrum of pharmacogenetically and conventionally framed specific structures, thereby clustering them into individual band clusters. Discussion DAE is a potential deep learning-based strategy to solve the undistorted presentation and clustering. DAEs have a predictive performance on high-dimensional datasets 38 and successfully tackled challenging tasks on millions of training samples 14 , 39 , 40 . In contrast to a fixed kernel function in nonlinear functions, autoencoders are learned by optimizing the reconstruction error 41 . Reconstruction effects reflect the ability to represent potential space. For data with high complexity such as drugs, it seems difficult for DAE to directly downscale to 3D space because information loss is inevitable. Thus current data such as MNIST can only be downscaled from 28×28×1 images to 128. Effort is focused on the joint methods in DAE that have been developed to handle the space distortion challenge 42 , 43 . Visualization is satisfied by additional downscaling methods. Deep autoencoder (DAE) is an undistorted clustering method to handle high dimensional datasets without additional clustering strategy. In contrast, it performs badly based on the 1.9 million biomolecule dataset when combining traditional clustering methods (UMAP and PCA) to 3-dimensional space from 128-dimensional DAEs latent spaces. Molecules exhibit a typical spherical distribution in 3D space. It doesn’t perform well and escapes the same target and substructure separation characteristics as MolF-DAE (Supplementary Fig. 21). In biological experiments, the research of biomolecules focuses on dozens or hundreds of data. Supervised learning research of biomolecules focuses on tens of thousands of high-quality target or disease data. Unsupervised learning breaks through the limitations of data quality. This work demonstrates significant improvements in both the size of the training dataset and the systematic utilization of physicochemical property feature dimensions. Surprisingly, all three sets of molecular fingerprints for various target types exhibit a characteristic radial distribution emanating from the origin. They exhibit target-specific clusters that in turn intrinsically reflect a more essential classification based on molecular structure. This has potential implications for biomolecule classification and research. The information obtained from the model substructures is highly consistent with human knowledge, enabling the possibility of subdividing the biomolecules into refined subclasses. It offers crucial clues to understand the relationship between the structure and activity of drugs. Meanwhile, MolF-DAEs offer the possibility of exploring the chemical space more impartially by effectively acquiring meaningful latent space, free from human rationality or bias. Finally, this work provides interpretability to the clustering distribution of unsupervised models, while also aiding in tasks such as understanding target-related drug structures, identifying potential drug candidates, and facilitating drug repurposing efforts. MolF-DAEs offer several possible directions of application. From a drug perspective, the conserved structures of a specific collection of molecules are used as a method for ligand-based virtual screening to discover new drugs for specific targets. Evaluate the novelty of bioactive skeleton and diversity of substituents. To guide the generation of new structures. From the target point of view, the biomolecule structure and drug diversity of the target can be evaluated, and the potential off-target targets of the compound list can be predicted. Evaluation of drug cross-reactivity of multiple proteins, etc. In the field of data analysis, compounds are a class of data types that are rich in structural information and have high-dimensional feature representations. In the future, based on this data can be migrated to more knowledge-related types of sparse high-dimensional data for non-destructive spatial clustering display and data mining. Method Data collection and labeling. Medium to high biomolecules with IC50, EC50 and Ki ≤ 10 µM through experimental methods such as MTT assay, kinase activity test, etc., are selected as datasets, from the pharmacochemical database ChEMBL 44 , 45 . It covers compounds from the preclinical to the approved stage. There are 1,943,048 biomolecules in total. Label standing for the verified target is used for visualization in 3DLSpace. Molecules with activity against four major drug target classes were queried, including kinase, protease, GPCR, nuclear receptor, etc. 124,632 biomolecules targeting kinase (red), 92,040 targeting protease (green), 33,718 targeting GPCR (Cyan) and 28,216 targeting nuclear receptor (purple) are obtained. Compounds with multiple target label values were excluded during 3D visualization. 1,658,503 biomolecules targeting other types of targets. Construction of molecular fingerprints Fmaps. We selected three molecular fingerprints with the highest citations in 1410 literature by 2024, according to the PubMed database, two substructure-key SMARTS-based features (PubChemFP, 192 bits and MACCSFP, 476 bits), and pharmacophore-based features (PharmacoPFP, 298 bits) 46 . These fingerprints are generated based on MolMap, which is a new method of molecular feature generation based on manifold learning 21 , 47 . We use PyBioMed to remove part of the PubChem molecular fingerprint unique thermal code infrequently, reducing the original 881 dimensions to 733 dimensions 48 . Optimization of the parameters of MolF-DAEs. A pair of complementary DNNs are adopted, with an encoder as an extractor to convert Fmaps into 3DLSpace for clustering the molecules, and a decoder to convert the latent codes back to the original FMaps for optimizing the autoencoder. Adam optimizer is adopted, with the loss function initially using binary cross entropy and MSE successively for 100 epochs training expectedly. The trend of the loss function (binary cross entropy (left) and MSE (right) process of loss function. $$\:{L}_{rec}=\text{min}\frac{1}{n}\sum\:_{i=1}^{n}{‖{x}_{i}-{\varphi\:}_{r}\left({\varphi\:}_{e}\left({x}_{i}\right)\right)‖}^{2}$$ Where \(\:{\varphi\:}_{e}(.)\) and \(\:{\varphi\:}_{r}(.)\) represents the encoder network and decoder network of MolF-DAE respectively. The hyperparameters of the autoencoder were optimized by the tree-structured estimator approach in two phases 49 . In phase 1, the number of layers varied from 2 to 100, the number of nodes in the first hidden layer was set at 2048 followed by up to 50% reductions in the subsequent layers. It proceeds until the reconstruction rate reaches > 95% (percent of position-to-position reconstruction of the original molecular fingerprints), which is comparable to the reported 95.3%-96.4% reconstruction rate of an autoencoder trained on 1,937,109 drug-like molecules 4 . In phase 2, the number of nodes in the hidden layers was fine-tuned. Phase 2 proceeds until the reconstruction rate reaches optimal value. The optimal parameters are detailed in Supplementary Table 10. In PuChemFPM, the encoder consists of a 5-layer fully connected neural network (DNN). In Encoder, each dense layer is followed by a ReLU activation function. The final layer connects to a fully connected layer with 3 nodes. Each dense layer in the decoder is followed by a ReLU activation function, except for the last dense layer, which employs a Sigmoid activation function. The total parameter count is 2,697,564, approximately 2.7 million. Comparing the sizes of individual images, MACCSFP constitutes only a quarter of PubChemFP. Consequently, models with original parameters possess excessive total parameters for the new dataset, leading to suboptimal Fmaps reconstruction. Models with smaller total parameters were adopted, and the parameter space was gradually expanded to find the optimal parameters. During optimization, a total parameter count of 2 million yields better training results. The composition of the 2 million parameters network was continuously adjusted. It’s an unstable model in the task of reconstructing images from large datasets, although the variation range of the stabilized MSE is small and within an acceptable range. Therefore, the choice between these parameters has minimal impact on the final model selection. The influence of model network depth, encoder structure, and total parameter count on the reconstruction effect outweighs the impact of the model's inherent instability. Chempack Software Demonstration Accessing spatial data linearly presents challenges in quickly addressing spatial queries during the data retrieval process and in conducting statistical analyses based on the macroscopic distribution of data in space. To address these issues, we have developed Chempack software for fast navigation and simultaneous display of the DAE-generated distribution landscapes of up to 50 million molecules in the 3-dimensional latent chemical space. The molecules within an on-screen subspace (a movable cubic box) are displayed as bright spheres (in default grey color) embedded in the black background space. Subsets of molecules may be highlighted by user-specified colors (via color selector). The molecules outside the moveable cubic box are un-displayed unless the box is moved into the local subspace. A multi-layer iterative data retrieval and display algorithm was employed for displaying the spheres within a local cubic box of 1/ 4N-1 (N = 1–8) of the volume of the global cubic box defined by the input spheres. The following procedure created the global and local cubic box. A user manual for Chempack is provided in the supplementary materials. In this paper, we utilize Chempack software to visualize 3DLSpace by inputting three numerical values as coordinate values. With the addition of labels previously annotated for 1.9 million chemical biomolecules, these serve as the original data coordinates, enabling visualization in 3DLSpace. Different labels are represented by different colors for visual differentiation. Chempack software assigns distinct colors to different target types, where each target type corresponds to a specific color label: kinase (red), protease (green), nuclear receptor (purple), G protein-coupled receptor (GPCR) (cyan), and other targets (gray). Declarations Acknowledgments The authors acknowledge the financial support from the National Key Research and Development Program of China (grant number: 2023YFA0913600) and Research Fund of Guangdong Province (grant number: 2024A1515011906). Code availability All code used in this paper is open source. Code for MolF-DAEs is available via GitHub at https://github.com/HuazhangYing/MolF-DAEs. Code for feature representation is available via GitHub at https://github.com/shenwanxiang/bidd-molmap. The dataset for 1.9 million bioactivate molecules is available at https://drive.google.com/drive/folders/1oPpz3biogeAEoa9-RydxdLDBcZQd-6o6?usp=drive_link. Resources for the 3D visualization software Chempack are openly accessible. For access or further information, please contact the authors. Author contributions Y.T., Y.C., and C.Q. conceived the project. H.Y. and X.W. develop the methodology, and performed experiments and analysis. H.Y. wrote the paper. H.Y., Y.C. and Y.T. contributed to the revision of the manuscript. Competing Interests statement The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References Oyedotun, O. & Dimililer, K. Pattern Recognition: Invariance Learning in Convolutional Auto Encoder Network. International Journal of Image, Graphics and Signal Processing 8, 19-27 (2016). https://doi.org:10.5815/ijigsp.2016.03.03 Dundar, A., Jin, J. & Culurciello, E. Convolutional Clustering for Unsupervised Learning. ArXiv abs/1511.06241 (2015). Polanski, J. Unsupervised Learning in Drug Design from Self-Organization to Deep Chemistry. International Journal of Molecular Sciences 23, 2797 (2022). Gómez-Bombarelli, R. et al. Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. ACS Central Science 4, 268-276 (2018). https://doi.org:10.1021/acscentsci.7b00572 Clarke, R. et al. The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews Cancer 8, 37-49 (2008). https://doi.org:10.1038/nrc2294 Chen, L., Saykin, A. J., Yao, B. & Zhao, F. Multi-task deep autoencoder to predict Alzheimer’s disease progression using temporal DNA methylation data in peripheral blood. Computational and Structural Biotechnology Journal 20, 5761-5774 (2022). https://doi.org:https://doi.org/10.1016/j.csbj.2022.10.016 Mullowney, M. W. et al. Artificial intelligence for natural product drug discovery. Nature Reviews Drug Discovery 22, 895-916 (2023). https://doi.org:10.1038/s41573-023-00774-7 Tropsha, A., Isayev, O., Varnek, A., Schneider, G. & Cherkasov, A. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. Nature Reviews Drug Discovery 23, 141-155 (2024). https://doi.org:10.1038/s41573-023-00832-0 Song, C., Liu, F., Huang, Y., Wang, L. & Tan, T. in Proceedings, Part I, of the 18th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - Volume 8258 117–124 (Springer-Verlag, Havana, Cuba, 2013). Walters, S. J. & Campbell, M. J. The use of bootstrap methods for analysing Health-Related Quality of Life outcomes (particularly the SF-36). Health Qual Life Outcomes 2, 70 (2004). https://doi.org:10.1186/1477-7525-2-70 Eckhardt, C. M. et al. Unsupervised machine learning methods and emerging applications in healthcare. Knee Surg Sports Traumatol Arthrosc 31, 376-381 (2023). https://doi.org:10.1007/s00167-022-07233-7 Altman, N. & Krzywinski, M. Clustering. Nature Methods 14, 545-546 (2017). https://doi.org:10.1038/nmeth.4299 Kamal, I. M. & Bae, H. Super-encoder with cooperative autoencoder networks. Pattern Recognition 126, 108562 (2022). https://doi.org:https://doi.org/10.1016/j.patcog.2022.108562 Hinton, G. E. & Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. Science 313, 504-507 (2006). https://doi.org:doi:10.1126/science.1127647 Steinbach, M., Ertöz, L. & Kumar, V. in New Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition (ed Luc T. Wille) 273-309 (Springer Berlin Heidelberg, 2004). Yu, W., Wang, R., Nie, F. & Wang, F. Multi-view embedded clustering with unsupervised trace ratio LDA. Neurocomputing 315, 169-176 (2018). https://doi.org:https://doi.org/10.1016/j.neucom.2018.07.014 Stumpfe, D. & Bajorath, J. Exploring Activity Cliffs in Medicinal Chemistry. Journal of Medicinal Chemistry 55, 2932-2942 (2012). https://doi.org:10.1021/jm201706b Han, Z. et al. Mesh Convolutional Restricted Boltzmann Machines for Unsupervised Learning of Features With Structure Preservation on 3-D Meshes. IEEE Trans Neural Netw Learn Syst 28, 2268-2281 (2017). https://doi.org:10.1109/tnnls.2016.2582532 Guo, X., Liu, X., Zhu, E. & Yin, J. 373-382 (Springer International Publishing). Zdrazil, B. et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Research 52, D1180-D1192 (2023). https://doi.org:10.1093/nar/gkad1004 Shen, W. X. et al. Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. Nature Machine Intelligence 3, 334-343 (2021). https://doi.org:10.1038/s42256-021-00301-6 Ilnicka, A. & Schneider, G. Compression of molecular fingerprints with autoencoder networks. Molecular Informatics 42, 2300059 (2023). https://doi.org:https://doi.org/10.1002/minf.202300059 Moon, K. R. et al. Visualizing structure and transitions in high-dimensional biological data. Nature Biotechnology 37, 1482-1492 (2019). https://doi.org:10.1038/s41587-019-0336-3 McInnes, L. & Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. ArXiv abs/1802.03426 (2018). Li, Y. H. et al. Clinical trials, progression-speed differentiating features and swiftness rule of the innovative targets of first-in-class drugs. Briefings in Bioinformatics 21, 649-662 (2019). https://doi.org:10.1093/bib/bby130 Vasaikar, S., Bhatia, P., Bhatia, P. & Yaiw, K.-C. Complementary Approaches to Existing Target Based Drug Discovery for Identifying Novel Drug Targets. Biomedicines 4, 27 (2016). https://doi.org:10.3390/biomedicines4040027 Attwood, M. M., Fabbro, D., Sokolov, A. V., Knapp, S. & Schiöth, H. B. Trends in kinase drug discovery: targets, indications and inhibitor design. Nature Reviews Drug Discovery 20, 839-861 (2021). https://doi.org:10.1038/s41573-021-00252-y Kumar, V. et al. Role of Tyrosine Kinases and their Inhibitors in Cancer Therapy: A Comprehensive Review. Curr Med Chem 30, 1464-1481 (2023). https://doi.org:10.2174/0929867329666220727122952 Shaban, N., Kamashev, D., Emelianova, A. & Buzdin, A. Targeted Inhibitors of EGFR: Structure, Biology, Biomarkers, and Clinical Applications. Cells 13 (2023). https://doi.org:10.3390/cells13010047 Santiago, A. D. S. et al. Structural Analysis of Inhibitor Binding to CAMKK1 Identifies Features Necessary for Design of Specific Inhibitors. Sci Rep 8, 14800 (2018). https://doi.org:10.1038/s41598-018-33043-4 Profeta, G. S. et al. Binding and structural analyses of potent inhibitors of the human Ca(2+)/calmodulin dependent protein kinase kinase 2 (CAMKK2) identified from a collection of commercially-available kinase inhibitors. Sci Rep 9, 16452 (2019). https://doi.org:10.1038/s41598-019-52795-1 Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nature Chemistry 4, 90-98 (2012). https://doi.org:10.1038/nchem.1243 Bondensgaard, K. et al. Recognition of Privileged Structures by G-Protein Coupled Receptors. Journal of Medicinal Chemistry 47, 888-899 (2004). https://doi.org:10.1021/jm0309452 Liao, J. J.-L. Molecular Recognition of Protein Kinase Binding Pockets for Design of Potent and Selective Kinase Inhibitors. Journal of Medicinal Chemistry 50, 409-424 (2007). https://doi.org:10.1021/jm0608107 van Linden, O. P. J., Kooistra, A. J., Leurs, R., de Esch, I. J. P. & de Graaf, C. KLIFS: A Knowledge-Based Structural Database To Navigate Kinase–Ligand Interaction Space. Journal of Medicinal Chemistry 57, 249-277 (2014). https://doi.org:10.1021/jm400378w Aronov, A. M., McClain, B., Moody, C. S. & Murcko, M. A. Kinase-likeness and Kinase-Privileged Fragments: Toward Virtual Polypharmacology. Journal of Medicinal Chemistry 51, 1214-1222 (2008). https://doi.org:10.1021/jm701021b Schneider, P. & Schneider, G. Privileged Structures Revisited. Angewandte Chemie International Edition 56, 7971-7974 (2017). https://doi.org:https://doi.org/10.1002/anie.201702816 Ilnicka, A. & Schneider, G. Designing molecules with autoencoder networks. Nature Computational Science 3, 922-933 (2023). https://doi.org:10.1038/s43588-023-00548-6 Gebru, T. et al. Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States. Proceedings of the National Academy of Sciences 114, 13108-13113 (2017). https://doi.org:doi:10.1073/pnas.1700035114 Silver, D. et al. Mastering the game of Go without human knowledge. Nature 550, 354-359 (2017). https://doi.org:10.1038/nature24270 Lu, S. & Li, R. DAC: Deep Autoencoder-based Clustering, a General Deep Learning Framework of Representation Learning. ArXiv abs/2102.07472 (2021). Ren, Y. et al. Deep Clustering: A Comprehensive Survey. IEEE Transactions on Neural Networks and Learning Systems , 1-21 (2024). https://doi.org:10.1109/TNNLS.2024.3403155 Xie, J., Girshick, R. & Farhadi, A. in Proceedings of The 33rd International Conference on Machine Learning Vol. 48 (eds Balcan Maria Florina & Q. Weinberger Kilian) 478--487 (PMLR, Proceedings of Machine Learning Research, 2016). Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res 47, D930-d940 (2019). https://doi.org:10.1093/nar/gky1075 Gaulton, A. et al. The ChEMBL database in 2017. Nucleic Acids Res 45, D945-d954 (2017). https://doi.org:10.1093/nar/gkw1074 Yang, K. et al. Analyzing Learned Molecular Representations for Property Prediction. Journal of Chemical Information and Modeling 59, 3370-3388 (2019). https://doi.org:10.1021/acs.jcim.9b00237 Shen, W. X. et al. AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks. Nucleic Acids Research 50, e45-e45 (2022). https://doi.org:10.1093/nar/gkac010 Dong, J. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Cheminform 10, 16 (2018). https://doi.org:10.1186/s13321-018-0270-2 Proceedings of the 24th International Conference on Neural Information Processing Systems . (Curran Associates Inc., 2011). Kanev, G. K., de Graaf, C., Westerman, B. A., de Esch, I. J. P. & Kooistra, A. J. KLIFS: an overhaul after the first 5 years of supporting kinase research. Nucleic Acids Research 49, D562-D569 (2020). https://doi.org:10.1093/nar/gkaa895 Additional Declarations There is NO Competing Interest. Supplementary Files supptable16.docx Supplementary Table 1-6 supptable716.xlsx Supplementary Table 7-16 suppfigure121.docx Supplementary Figure 1-21 supplementFig22.pdf Supplementary Figure 22 pharmacophore.mp4 Software Video Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6755378","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":470407187,"identity":"96c57abc-bf6b-4f40-8729-5d18736de4fa","order_by":0,"name":"YING TAN","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABIklEQVRIiWNgGAWjYNACAwYGfghLAoyAgBmnYh6YFskGIH2AeC0gXQfAWhgIa7FnP3uAmafgjt3m88cfMH/4Y5EnH91j+IGhwjqxASiF1RaevARmHoNnydtuJCQwHGyTKDa8c8ZYguFMemIDUAq7w3IMgFoOJ5vdADrrYINE4sYZOQYSjG2HExskeAywauF/A9Fi3H+wgeHAH7AW4x+M//BokYDYYmfAkAz0PptE4nyJHDMJxgY8Wm68MTg4x+BwgsSNNIYDZ9skEjdIpJVZJBxLN27jycGqhb0/x/DBmz+H7fn7jz98UPGnLnH+jOTNNz7UWMv2s5/BqgUEDgEjJ7GBARot4AgCBRUbLvVAwPgDGD9wnnwDHqWjYBSMglEwIgEAS/NgjLqgktcAAAAASUVORK5CYII=","orcid":"","institution":"Tsinghua University","correspondingAuthor":true,"prefix":"","firstName":"YING","middleName":"","lastName":"TAN","suffix":""},{"id":470407188,"identity":"a8f1e019-7db5-4a6b-af7e-431c9d5919d8","order_by":1,"name":"Huazhang Ying","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Huazhang","middleName":"","lastName":"Ying","suffix":""},{"id":470407189,"identity":"bc07e860-c92b-4dc3-b394-39c7dec05db1","order_by":2,"name":"Xiang Wu","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Xiang","middleName":"","lastName":"Wu","suffix":""},{"id":470407190,"identity":"8ae41d4c-1e07-4818-8ea7-f87ef7c44e67","order_by":3,"name":"Chu Qin","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Chu","middleName":"","lastName":"Qin","suffix":""},{"id":470407191,"identity":"89a4551c-c596-43d5-af38-2291cdd28323","order_by":4,"name":"Likun Zhang","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Likun","middleName":"","lastName":"Zhang","suffix":""},{"id":470407192,"identity":"9e707683-7951-4abe-86da-65046e31862b","order_by":5,"name":"Zhicheng Du","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Zhicheng","middleName":"","lastName":"Du","suffix":""},{"id":470407193,"identity":"1cd7ab7b-b397-4683-9908-8409f6fa4a05","order_by":6,"name":"Jiaqi Liu","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Jiaqi","middleName":"","lastName":"Liu","suffix":""},{"id":470407194,"identity":"7f6f00a9-0f08-4ad3-aa77-a5f027707a2d","order_by":7,"name":"Yu Zong Chen","email":"","orcid":"https://orcid.org/0000-0002-5473-8022","institution":"Tsinghua Shenzhen International Graduate School, Tsinghua University","correspondingAuthor":false,"prefix":"","firstName":"Yu","middleName":"Zong","lastName":"Chen","suffix":""}],"badges":[],"createdAt":"2025-05-27 04:55:24","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6755378/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6755378/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":84703836,"identity":"6f46ed9a-7444-49a0-a8ee-1cfb90644aa4","added_by":"auto","created_at":"2025-06-16 12:01:57","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":1723600,"visible":true,"origin":"","legend":"\u003cp\u003eWorkflow of MolF-DAEs for unsupervised clustering in construction, visualization, and analysis of biomolecules. Symmetric DNNs are used to build MolF-DAEs. 1.9 million biomolecule data is used as an example, but the strategy is applicable to 3D non-destructive visualisation of other high dimensional data. Five categories of molecules are labeled according to target types (kinase-inhibitor, GPCR-binding, nuclear receptor-binding, protease-inhibitor, and remainders). SMILES are transformed into three types of Fingerprint FMaps. The molecular coordinates in 3DLSpace are visualized using the 3D visualization software, Chempack. The 3D clustering is utilized for downstream applications including motif analysis and bioactivity-related cluster analysis.\u003c/p\u003e","description":"","filename":"figure1XXXXXX1.png","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/5bc65aed3070e7bd75d7b382.png"},{"id":84704943,"identity":"b52c81ea-8417-426c-875f-c3010f8d7fea","added_by":"auto","created_at":"2025-06-16 12:09:57","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":2047579,"visible":true,"origin":"","legend":"\u003cp\u003eFully-connected deep autoencoder shows high accuracy in reconstructing Fmaps. (a) Training process of the MSE models. The trend of the loss function in first stage (binary cross entropy) (left) and second stage (MSE) (right) process. (b) Parameter usage and corresponding optimal MSE within different molecular fingerprint channels. (c) Reconstruction effect of Fmaps. Three pairs of Fmaps are shown in different colors. The left represents the original molecular fingerprint images, while the right represents the reconstructed 2D fingerprint Fmaps. White dots denote values of 1, while colored dots (blue: PubChemFP, green: MACCSFP, yellow: PharmacoPFP) represent values of 0, with darker colors indicating closer proximity to 1.\u003c/p\u003e","description":"","filename":"figure2XXXXXX1.png","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/441be0677d4dc76ee64cdfc1.png"},{"id":84703841,"identity":"366fab6f-fcfd-4419-8dda-3cf75289b990","added_by":"auto","created_at":"2025-06-16 12:01:57","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":5744132,"visible":true,"origin":"","legend":"\u003cp\u003eThe visualization of biomolecules within 3DLSpace, facilitated by the intuitive interface of the Chempack software. (a-c) Distribution of 1.9 million molecules in (a) PubChemFPM (b) MACCSFPM and (c) PharmacoPFPM. (d-g) Distribution of biomolecules in 3DLSpace and target labeling cluster to four target types (Red: kinase, green: protease, Cyan: GPCR). Molecules in representative clusters are colored to the target component.\u003c/p\u003e","description":"","filename":"figure3XXXXXX1.png","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/d17bce87986e3d1560205fc4.png"},{"id":84703839,"identity":"058ec957-c0be-4cca-b3e9-f8e4c3ab7bfa","added_by":"auto","created_at":"2025-06-16 12:01:57","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":2692514,"visible":true,"origin":"","legend":"\u003cp\u003eThe overall molecule feature distributions and differences in three models in MolF-DAEs. (a) The properties of biomolecules in six representative bands without the gray points. (b) The proportions of all 20 kinase groups within the four kinase-enriched bands 1, 2, 5, and 6. (c) The proportions of the top 10 kinase \u0026nbsp;families within the four kinase-enriched bands 1, 2, 5, and 6. (d) Kinase and GPCR islands in kinase-enriched band 1 and randomly chosen molecule structures and their core elements.\u003c/p\u003e","description":"","filename":"figure4XXXXXX1.png","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/0686175f3c620e9a4b8cae4b.png"},{"id":84704945,"identity":"9e1fec5d-edae-49ad-977f-1f76f2e82265","added_by":"auto","created_at":"2025-06-16 12:09:58","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":1977596,"visible":true,"origin":"","legend":"\u003cp\u003eThe regularity of substructures explains the patterns of band clustering. (a) Two Models and their substructure composition elements. (b) Substructure enrichment in the 4 kinase-enriched bands compared to the background. (c) Samples and substructure in kinase enriched bands in three models. (Top: PubChemFPM, middle: MACCSFPM, down: PharmacoPFPM) (d) Molecules colored with amide bond and its variants in protease bands. (e) GPCR enriched islands colored by atomic type.\u003c/p\u003e","description":"","filename":"figure5XXXXXX1.png","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/dc9a98741577df2b81657367.png"},{"id":84703844,"identity":"da732699-e475-44e8-bf46-5383a1cb3c1e","added_by":"auto","created_at":"2025-06-16 12:01:58","extension":"png","order_by":6,"title":"Figure 6","display":"","copyAsset":false,"role":"figure","size":1624588,"visible":true,"origin":"","legend":"\u003cp\u003eBinding sites of molecules in kinase-enriched bands in PubChemFPM. (a-c) Kinase inhibitors and binding pockets in island 1 in kinase-enriched band 1. Combination pattern and pocket annotation is obtained from KLIFS\u003csup\u003e50\u003c/sup\u003e. (d) Kinase inhibitor and binding pocket in kinase-enriched band 2.\u0026nbsp;\u003c/p\u003e","description":"","filename":"figure6XXXXXX1.png","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/a9f67f5933eebf928b765438.png"},{"id":84706581,"identity":"b8437d44-4b27-41f9-98b3-19bcc28163bf","added_by":"auto","created_at":"2025-06-16 12:34:07","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":15759976,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/99faa2ea-18fb-4fd3-8ce3-9ceb388808aa.pdf"},{"id":84705401,"identity":"ab914e83-7237-453d-8691-a293719c6111","added_by":"auto","created_at":"2025-06-16 12:17:57","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":489999,"visible":true,"origin":"","legend":"Supplementary Table 1-6","description":"","filename":"supptable16.docx","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/3cc709a292f384e36f2b1f8f.docx"},{"id":84704946,"identity":"550f3577-c13c-4535-919b-16c758c964dc","added_by":"auto","created_at":"2025-06-16 12:09:58","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":15933742,"visible":true,"origin":"","legend":"Supplementary Table 7-16","description":"","filename":"supptable716.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/11858e5498dadaf7be121c3d.xlsx"},{"id":84704958,"identity":"b94a9a05-6ff0-49bd-ae18-62190df0e9f3","added_by":"auto","created_at":"2025-06-16 12:09:59","extension":"docx","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":21912772,"visible":true,"origin":"","legend":"Supplementary Figure 1-21","description":"","filename":"suppfigure121.docx","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/2a3c721c3ca69eb55f7f6495.docx"},{"id":84703843,"identity":"b9214fc6-2dca-433c-8f9a-44d0e13973f2","added_by":"auto","created_at":"2025-06-16 12:01:58","extension":"pdf","order_by":4,"title":"","display":"","copyAsset":false,"role":"supplement","size":1356514,"visible":true,"origin":"","legend":"Supplementary Figure 22","description":"","filename":"supplementFig22.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/1d5560d0e40a68c76fc81402.pdf"},{"id":84703884,"identity":"f7a40f5e-03d9-48fa-894e-2e5a9b9bc234","added_by":"auto","created_at":"2025-06-16 12:01:59","extension":"mp4","order_by":5,"title":"","display":"","copyAsset":false,"role":"supplement","size":35165820,"visible":true,"origin":"","legend":"Software Video","description":"","filename":"pharmacophore.mp4","url":"https://assets-eu.researchsquare.com/files/rs-6755378/v1/fe820f5ec932e6e4a630a11b.mp4"}],"financialInterests":"There is \u003cb\u003eNO\u003c/b\u003e Competing Interest.","formattedTitle":"Molecular-substructure Deep Autoencoders Cluster Biomolecules into Novel Band-Shaped Substructure-Distinguished Bioactivity Clusters in 3D Latent Space","fulltext":[{"header":"Introduction","content":"\u003cp\u003eUnsupervised learning such as deep clustering methods has been widely applied in real-life statistical analysis such as pattern recognition\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e, image processing\u003csup\u003e\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e, and knowledge discovery tasks such as bioactive molecule (biomolecules) clustering\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e, genomics data mining\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e, and disease diagnosis\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e. One application of deep clustering is in drug discovery, where effective clustering of biomolecules with respect to common molecular determinants facilitates the mapping of pharmacological chemical space\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e and the investigation of structure-activity relationships\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eVarious clustering methods have been developed based on the fundamental data features or their linearly/non-linearly transformed variants\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e, such as K-means\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e, hierarchical clustering\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e and spectral clustering. Moreover, deep autoencoders (DAEs) are highly useful for deep and complex clustering tasks\u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e. Due to the high dimensionality, sparsity and variance of data features, these clustering methods rely on feature representation and dimensionality reduction techniques\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e. Principal Component Analysis (PCA), Uniform Manifold Approximation and Projection (UMAP), and t-Distributed Stochastic Neighbor Embedding (t-SNE)\u003csup\u003e\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e are the most common algorithms used as a preprocessing step to provide useful cluster patterns. Under these methods, the visualization of data clusters and the subsequent analysis may be affected by visual distortion in the low-dimensional space. Minor visual distortions may in some cases affect the quality of cluster analysis. For example, minor differences in biomolecular structures (i.e. minor differences in the separation of the cluster neighbors) may lead to substantial changes in bioactivity targets and bioactivity values \u003csup\u003e\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e. There is a need for effective methods to both cluster data and visualize the undistorted cluster patterns.\u003c/p\u003e \u003cp\u003eDAEs \u003csup\u003e\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e may be potentially explored for data clustering and undistorted visualization. It captures nonlinear relationships of complex patterns while preserving both local and global characteristics\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e,\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e. In order for undistorted visualization of the DAE-derived data clusters, one may consider the construction of DAEs with 3DLSpace, where the data clusters can be straightforwardly visualized in the 3DLSpace without data distortion. A question is whether DAEs can meaningfully cluster data in 3DLSpace. Here we developed molecular fingerprint deep autoencoders (MolF-DAEs) to demonstrate the clustering ability of DAEs in 3DLSpace. We further revealed the DAE-derived cluster patterns of bioactive molecules and discussed their potential implications to drug discovery tasks.\u003c/p\u003e \u003cp\u003eMolF-DAEs consist of symmetric fully connected encoders and decoders, which were trained by 1.9\u0026nbsp;million biomolecules from the ChEMBL database\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e represented by three sets of molecular fingerprints (MFs). The three sets of DAEs are the PubChem molecular fingerprint model (PubChemFPM), MACCS keys fingerprint model (MACCSFPM), and 2D pharmacophore fingerprints model (PharmacoPFPM)\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Compared with existing methods, MolF-DAEs do not require additional dimensionality reduction methods. Additionally, we developed a chemical space navigation simulation software Chempack for displaying and analyzing the band cluster landscapes. Rather than solving a specific downstream task prediction, MolF-DAEs aim at mining reliable experimentally obtained high activity data to evaluate activity-related compound spatial and organism target spatial. This method can be migrated to other types of sparse high-dimensional data mining.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e1. High Accuracy of Deep Autoencoders in Reconstructing Three Fingerprint Feature Maps\u003c/p\u003e\n\u003cp\u003eThis work demonstrates the high efficacy of deep autoencoders in reconstructing biomolecule fingerprint feature maps (Fmaps). In order to increase the efficiency, we focus on the molecules presenting preliminary biological activity. In total, a 1.9 million dataset with high bioactivity (Potency Values \u0026le;10 \u0026mu;M) was\u0026nbsp;collected in the\u0026nbsp;ChEMBL database. Three deep autoencoders are constructed to encode three sets of molecular fingerprint features, including PubChemFP and MACCSFP based on SMILES arbitrary target specification (SMARTS), and PharmacoPFP based on pharmacophore. The reconstruction rates for PubChemFPM, MACCSFPM, and PharmacoPFPM reach 97.55%, 96.10% and 97.55% respectively, surpassing the reported reconstruction rates of 95.3%-96.4% from\u0026nbsp;a\u0026nbsp;latent space dimension\u0026nbsp;of\u0026nbsp;196 in\u0026nbsp;a\u0026nbsp;variational autoencoder\u0026nbsp;(VAE)\u0026nbsp;trained on 250,000 drug-like molecules\u003csup\u003e4\u003c/sup\u003e. It indicates the precision of models in constructing two-dimensional Fmaps, thereby establishing their reliability in interpreting the molecule physicochemical properties, structural fragments, and spatial distributions. Taking the PubChemFP model as an example, the loss values stabilize around 0.0245 after 100 training epochs (Fig.2a). Despite the correlation between the parameter count and the performance of the models, the minimum MSE may not necessarily require the maximum number of parameters (Fig.2b). Compared with SMILES, the Fmaps are of higher distinguishment among molecules. As the MSE decreases, the visual improvement in the reconstruction of Fmaps becomes apparent (Fig.2c, Supplementary Fig.1-6). Despite the low dimensional vector cause difficulties in accurately reconstructing fingerprints\u003csup\u003e22\u003c/sup\u003e, the majority of the original data characteristics preserved post-reconstruction, barring slight deviations in some local features. This attributed to the extensive training data and feature representation methodologies.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e2. Three sets of molecular fingerprints FP-DAEs exhibit band-shaped clustering patterns in 3DLSpace\u003c/p\u003e\n\u003cp\u003eIt is intriguing to see the distribution landscape in latent space based on substructures and pharmacophores. Overall, these distributions are orderly and highly consistent with human knowledge. MolF-DAEs achieve optimal distribution in 3DLSpace as training progresses (Fig.3a-c). It exhibits a discontinuous spatial distribution with distinct boundaries. Internally, molecules are arranged into bands and each of them originates from a common central region. This arrangements was observed in nonlinear algorithms such as UMAP, representing mappings dominated by the strong effects of major gradient features\u003csup\u003e23,24\u003c/sup\u003e. It is observed for the first time in DAEs.\u003c/p\u003e\n\u003cp\u003eThe latent space exhibits distinct, island-like regions\u0026mdash;dense clusters where specific molecular substructures or pharmacophore labels are uniquely enriched. These regions emerge as training progresses, separating from the general molecular distribution to form cohesive zones, each representing variations of a common molecular feature.\u0026nbsp;Target type is a molecular feature characterization received widespread attention\u003csup\u003e25\u003c/sup\u003e. It\u0026rsquo;s reports that over 50% of drug design targets are concentrated in four categories, including kinase, protease, nuclear receptor and G-protein coupled receptor (GPCR) \u003csup\u003e26\u003c/sup\u003e. However, these target families cover only 1.45%-6.42% of experimental verified biomolecules in the 1.9 million compound database. This indicates that the space of biomolecules remains vast. In MolF-DAEs, without any prior target information, these targets naturally appear in relatively isolated regions \u0026nbsp;(Fig.3d-g). For example, in some of these regions, the concentrations of specific targets reached the following values: kinase (71%), protease (71%), and GPCR (54%). Therefore, MolF-DAEs demonstrate exceptional effectiveness in identifying target-specific clusters. Within the PubChemFPM, kinase inhibitors are concentrated in the upper region in long islands, while protease inhibitors are concentrated in the lower region. GPCR has fewer known drugs and more diverse natural ligands\u003csup\u003e27\u003c/sup\u003e. The clustering of GPCR appears dispersed and concentrated in short islands (Fig.3d, Supplementary Fig.7).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe three most common literature-reported molecular fingerprint features are used (Supplementary Figs. 7-9). Each of them exhibits unique clustering patterns. PubChemFPM captures the substructural features of molecules, with target-specific bands with the\u0026nbsp;clearest separation. In contrast, MACCSFPM mainly focuses on the overall structure of molecules, closely connected among clusters. PharmacoPFPM mainly describes the position, spatial relationship, and interaction pattern of pharmacophores in molecules. It exhibits many distinctly isolated short islands in addition to the band structures. The performance of downstream tasks is related to factors like fingerprint types, dimensions, compression, and redundancy\u003csup\u003e22\u003c/sup\u003e.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e3. Bands exhibit significant differences in the potential biotherapeutic targets within the 3DLSpace\u003c/p\u003e\n\u003cp\u003eThe in-cluster biomolecules are of varying molecular structures but frequently form a limited number of bioactivity classes. Significant concentrations are captured in the scale of substructures, physical properties, and targeting families. For subsequent analyses and applications, we manually select six representative bands with clear boundaries spanning the compound space, across all fingerprint channels (band 1-4 PubChemFPM, band 5: MACCSFM, band 6: PharmacoPFPM) and privileged target islands (band 1-2, 5-6 kinase, band 3 protease, band 4 GPCR). Although there is a fair number of confusing ligands (gray), the four major target classes account for up to 34% of points (Fig.4a and Supplementary Data 8). \u0026nbsp;Kinase is a class of targets with relatively conserved binding pockets, causing widespread off-target effects. Kinase inhibitor takes 69.22%-84.01% in bands 1-2, 5-6 (Fig.4a) and significant changes appear among families (Fig.4b) and groups (Fig.4c). In four kinase enriched bands, members of the Janus kinase (JakA) and Receptor tyrosine kinase (RTK) group, such as the Epidermal growth factor receptor (EGFR) and fibroblast growth factor receptor (FGFR) family exhibit highest proportions. The Tec family is uniquely enriched in PubChemFPM band 1. \u003cu\u003eEGFR i\u003c/u\u003e\u003cu\u003enhibitors such as Erlotinib and Gefitinib are developed as anticancer therapeutics, mainly consisting of rings as a basic skeleton\u003c/u\u003e\u003cu\u003e\u003csup\u003e28,29\u003c/sup\u003e\u003c/u\u003e\u003cu\u003e. They are enriched within the kinase cluster and positioned closely to each other, with a distance of 25.42\u003c/u\u003e\u003cu\u003e.\u003c/u\u003e The top 10 families enriched in PharmacoPFPM are more concentrated and closer in position on inter-group and evolutionary trees (Supplementary Fig.11). Evolutionarily closely related kinase families tend to have similar core structures, likely to cause cross-effects. Conversely, the PubChemFP band covers a broader range of families, including a new enrichment inhibitor group with fewer known inhibitor studies such as Tyrosine kinase-like (TKL), Calcium/calmodulin-dependent protein kinase (CAMK) and the remainder\u003csup\u003e27\u003c/sup\u003e. The degree of enrichment varies significantly\u0026nbsp;in\u0026nbsp;remainders. PharmacoPFP band 5 enriches 30.61% of\u0026nbsp;the\u0026nbsp;Calcium/calmodulin-dependent protein kinase group\u0026nbsp;(CAMK) with fewer reported\u0026nbsp;researches\u003csup\u003e30,31\u003c/sup\u003e. It indicates that relevant structures with similar activities are sensitively captured by the model in undistorted clusters. This contributes to novel target inhibitor pattern researches.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e4. Core substructures combination explains undistorted band-shape cluster organization\u003c/p\u003e\n\u003cp\u003eThere is core structural unity and local residue diversity in \u003cimg width=\"25\" height=\"19\" src=\"data:image/png;base64,R0lGODlhGQATAHcAMSH+GlNvZnR3YXJlOiBNaWNyb3NvZnQgT2ZmaWNlACH5BAEAAAAALAEAAgAXAA0AhQAAAAAAAAAAOgA6OgA6ZgA6kABmtjoAADo6kDpmkDqQ22YAAGY6OmaQtmaQ22a222a2/5A6AJA6ZpBmOpC2/5Db/7ZmALb//9uQOtuQZtu2Ztv/ttv///+2Zv/bkP/btv//tv//2wECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwECAwZ1QIBwSCwONYdAgmNsEj0IzmcBcVqHIclFGJpUiciA4FHMfMNfoedAxggqwwxFmGlsqERLgQlaGOhVHV99aSERCkN6HGsBjYN4Qn2IQopOhEOSiXuWkAB9f5SbTZeaTIaTo50AHQFkGm9NHgyNAQMOcUkEcE1BADs=\" alt=\"image\"\u003e\u0026nbsp;sample size, suggesting the high-quality undistorted distribution pattern. Privileged target types exist, while the principle behind target label clustering is limited to FMap-related substructural classes. GPCR binding fragments share a conserved structural scaffold, whereas kinase inhibitors exhibit more variations in the scaffold and substituents. However, distinct GPCR islands with clear boundaries and structural similarities were identified in kinase-enriched band 1 (Fig.4d). 95.21% of molecules in band 1 share hydrogen bond sites N-C-N-C including the GPCR islands, while other GPCR-enriched bands seldom take this substructure (3.23%). In another kinase-enriched band, the N-C-N-C-C-N structure is over twice as prevalent compared to band 1. Molecules in the kinase island typically exhibit higher LogP, while both molecular weight and LogP of biomolecules in the GPCR islands are lower. Thus, MolF-DAEs separate bands based on various substructural or pharmacophore features. This further explains the intrinsic reasons for cluster-target relation. Furthermore, this data-driven clustering contributes to drug novelty evaluation. Molecules contain amide bonds (-CONH-) with hydrogen bonding, and methoxy-substituted phenyl rings are enriched in kinase island 1. The structure is common in various drugs, including anticancer, antibacterial, and anti-inflammatory drugs like Amitriptyline. Methoxy-substituted phenyl rings are relatively easy to introduce in organic synthesis, and methoxy groups increase lipophilicity, affecting drug metabolism and absorption. In kinase island 2, more substituents like methoxy (-OCH3), fluorine (-F), and chlorine (-Cl) are present. Notably, the gray biomolecules are highly likely to be new kinase biomolecules. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eTo explain the observed undistorted clustering, we summarize the common functional structural modes. It provides a basis for binding studies. Structural analysis reveals a widespread substructural pattern. Each band cluster primarily consists of molecules with unique scaffold or substructure combinations, with minimal overlap with other clusters. The kinase-specific substructure mode involves a combination of two elements: core hydrophobic ring elements (blue) and core hydrogen site elements (red) (Fig.5a). These substructures account for less than 15% of the compounds in other target bands (Fig.5c, Supplementary Fig.12b). The linear structures serve as hydrogen donors and acceptors. It is decisive for the overall binding strength and positioning. Two kinds of linear structure are observed, one is the linear framework and its variants, and the other is the Y-shaped frameworks and their variants. These elements like Y-shaped N-O-N, N-S-N, N-O-CN, or L-shaped NC-N-CN, along with core hydrophobic ring elements, are observed in Lapatinib, Imatinib and Sorafenib. Hydrophobic ring elements contain 6-membered or 5-membered carbon rings, which is important for hydrophobicity and aids in overall stabilization. The ring elements are connected by one or more carbon chains to the core elements. The question is which molecular features are crucial for selective kinase targeting and potency. Compared with the\u0026nbsp;overall common background, we counted the frequency of occurrence of these substructures in 1.9 million molecules and that on the band, respectively. In kinase- specific inhibitors bands, the core element combination reaches up to 22.01 fold to background (Fig.5b, supplementary Fig. 9). This indicates that MolF-DAEs highly select this bioactive substructure as an important feature.\u003ca id=\"_anchor_1\" href=\"#_msocom_1\" language=\"JavaScript\" name=\"_msoanchor_1\"\u003e\u003c/a\u003e Overall, in PubChemFPM regions, the linear carbon-nitrogen substructure (NC-N-C) appears in almost all known kinase inhibitors (band 1: 100%, band 2: 99.8%) and other (band1: 95.2%, band2: 95.1%). And most of the kinase inhibitors are connected to at least one six-membered ring (band 1: 62.2%, band 2: 40.4%) or five-membered ring (band 1: 43.4%, band 2: 37.1%). MACCSFPM covers fewer chemical substructures compared to the PubChemFPM, indicating a more concentrated distribution. There is no Y-shaped substructure connected to six-membered rings in band 5. Pharmacophore fingerprint primarily focuses on describing the position, spatial relationships, and interaction patterns of pharmacophores in molecules. The clustering of PharmacoPFPM in 3DLSpace is more pronounced, with relatively high proportions in various feature substructures.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eProtease is another target type significantly independent in the\u0026nbsp;band. Peptide chains are prevalent, exhibiting variations in C-chain length and element substitutions. In fact, protease inhibitors contest with natural substrates peptides while not being degraded by proteases. Some unidentified ligands with long peptide chains are captured in the band, such as CHEMBL100202 and CHEMBL102898. In terms of property labeling, this is at variance with the distribution of overall physicochemical properties of approved drugs reported\u003csup\u003e32\u003c/sup\u003e. \u0026nbsp;Similar to peptide drug, protease-privilege bands generally have high molecular weights, lower logarithmic partition coefficients (LogP) values \u0026nbsp;(indicating greater hydrophilicity), and higher topological polar surface area (TPSA). \u0026nbsp;For example, 27% of the molecules exhibiting a LogP below 0, which is significantly higher than average for approved drugs (focus on 1-3). The amide bonding increases the polarity of the molecule and the number of hydrogen bonding donors/acceptors. These differences in structure and physicochemical properties tend to have lower quantitative estimates of drug-likeness (QED) values , averaging around 0.1\u0026mdash;well below the reported average of 0.35. This underscores the unique structural and drug-forming properties of this cluster\u003csup\u003e32\u003c/sup\u003e.However, these features, while conducive to protease activity, pose additional challenges in pharmacokinetics for molecule in this band, similar to those faced by peptide drugs.\u003c/p\u003e\n\u003cp\u003eThere are GPCR concentrated blocks in band 1 and band 4 despite the fact that the number of compounds is less than half that of kinase and protease. In contrast to the kinase conserved pocket, the natural substrates of GPCR are more mixed, including both nucleic acid substrates and peptide substrates. In 3DLSpace, GPCR-specific clusters are mostly found in the form of short islands, with more dispersed molecular clustering. Some are distributed within other target clusters, implying that they have similar functional groups or competing substrates. However, a small number of independent long island GPCR bands still exist in band1. Structures within islands are often very similar in specific lengths of core hydrogen bonding elements (Y-shaped NCO, NCN, NCCO), combined with core hydrophobic ring elements (R and RN)\u003csup\u003e33\u003c/sup\u003e. Islands differ mainly in substituents, such as O/N-rich islands 1 and 2, and F/Cl-rich island 2. GPCRs have binding pockets that vary significantly in nature, with F providing strong liposolubility possibly related to binding to the water transport pocket in the transmembrane region, and O being closely associated with the extracellular region.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e5. Relevance and distinction of sub-structural features with respect to the literature-reported privileged pharmacophores and drug-binding mode\u003c/p\u003e\n\u003cp\u003eA question is raised about the relevance of the DAE-captured sub-structural features of an individual band cluster to the selected bioactivities of the band cluster. Through literature-reported kinase structures, the enriched categories are highly consistent with those reported\u003csup\u003e34\u003c/sup\u003e. Some substructural features of the band clusters comprise key frameworks in literature-reported kinase-binding modes of kinase inhibitor drugs, pharmacophores of kinase frequent hitters, and privileged fragments of kinase inhibitors. The L- and Y-shaped regions are usually binding to a stabilized presence of a hinge area. Core hydrogen bonding structures in different lengths allow for the choice of four sites that bind 2\u0026nbsp;to\u0026nbsp;3 sites to the hinge region or gatekeepers. Hydrophobic rings, which are larger and segmented into two parts, are more likely to extend into the E0 or back pocket to form stacking interactions. Different regions with unique substructure patterns are selective in their binding conformation. Structural differences in the remaining parts allow binding pockets of different conformations\u003csup\u003e35\u003c/sup\u003e. The reported kinase-binding conformations of band 2 with short/Y-shape hydrogen sites (PDB: 9JI, 3JW, GD9, P06, 6S1, and RXT) are in the front pocket, without the back pocket. Complexes with multiple long-chain hydrogen sites in band 1 (PDB: TZ0, R1L and UCW) occupy both the front and back pockets (Fig.6).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe sub-structural characterization provided by MolF-DAEs is closely consistent with the framework of selected bioactivities and the concentration of these biomolecules within the bands. This correlation is notably in line with the finding that kinase-specific fragments can enhance kinase inhibitors by 5-fold\u003csup\u003e36\u003c/sup\u003e, with sub-structures within individual bands exhibiting enrichments exceeding 25-fold. This concurrence also echoes reports indicating that certain specific molecular scaffolds can exhibit activity against multiple target classes. However, the substructural elements captured by MolF-DAEs diverge from the pharmacogenetic and privileged fragments outlined in the literature in one aspect. While MolF-DAEs capture the fundamental substructural elements and their structural variations, which collectively define the framework of pharmacogenetic or privileged fragments, kinase-specific structures documented in literature often manifest as specific combinations of fragments, such as bis-aryl-NH-linked fragments and biphenyl ether scaffolds\u003csup\u003e37\u003c/sup\u003e. Consequently, by assimilating the foundational elements of structural frameworks, MolF-DAEs possess the capacity to capture a broader spectrum of pharmacogenetically and conventionally framed specific structures, thereby clustering them into individual band clusters.\u0026nbsp;\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eDAE is a potential deep learning-based strategy to solve the undistorted presentation and clustering. DAEs have a predictive performance on high-dimensional datasets\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e and successfully tackled challenging tasks on millions of training samples\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e,\u003cspan citationid=\"CR40\" class=\"CitationRef\"\u003e40\u003c/span\u003e\u003c/sup\u003e. In contrast to a fixed kernel function in nonlinear functions, autoencoders are learned by optimizing the reconstruction error\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e. Reconstruction effects reflect the ability to represent potential space.\u003c/p\u003e \u003cp\u003eFor data with high complexity such as drugs, it seems difficult for DAE to directly downscale to 3D space because information loss is inevitable. Thus current data such as MNIST can only be downscaled from 28\u0026times;28\u0026times;1 images to 128. Effort is focused on the joint methods in DAE that have been developed to handle the space distortion challenge\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e,\u003cspan citationid=\"CR43\" class=\"CitationRef\"\u003e43\u003c/span\u003e\u003c/sup\u003e. Visualization is satisfied by additional downscaling methods. Deep autoencoder (DAE) is an undistorted clustering method to handle high dimensional datasets without additional clustering strategy. In contrast, it performs badly based on the 1.9\u0026nbsp;million biomolecule dataset when combining traditional clustering methods (UMAP and PCA) to 3-dimensional space from 128-dimensional DAEs latent spaces. Molecules exhibit a typical spherical distribution in 3D space. It doesn\u0026rsquo;t perform well and escapes the same target and substructure separation characteristics as MolF-DAE (Supplementary Fig.\u0026nbsp;21).\u003c/p\u003e \u003cp\u003eIn biological experiments, the research of biomolecules focuses on dozens or hundreds of data. Supervised learning research of biomolecules focuses on tens of thousands of high-quality target or disease data. Unsupervised learning breaks through the limitations of data quality. This work demonstrates significant improvements in both the size of the training dataset and the systematic utilization of physicochemical property feature dimensions. Surprisingly, all three sets of molecular fingerprints for various target types exhibit a characteristic radial distribution emanating from the origin. They exhibit target-specific clusters that in turn intrinsically reflect a more essential classification based on molecular structure. This has potential implications for biomolecule classification and research. The information obtained from the model substructures is highly consistent with human knowledge, enabling the possibility of subdividing the biomolecules into refined subclasses. It offers crucial clues to understand the relationship between the structure and activity of drugs. Meanwhile, MolF-DAEs offer the possibility of exploring the chemical space more impartially by effectively acquiring meaningful latent space, free from human rationality or bias. Finally, this work provides interpretability to the clustering distribution of unsupervised models, while also aiding in tasks such as understanding target-related drug structures, identifying potential drug candidates, and facilitating drug repurposing efforts.\u003c/p\u003e \u003cp\u003eMolF-DAEs offer several possible directions of application. From a drug perspective, the conserved structures of a specific collection of molecules are used as a method for ligand-based virtual screening to discover new drugs for specific targets. Evaluate the novelty of bioactive skeleton and diversity of substituents. To guide the generation of new structures. From the target point of view, the biomolecule structure and drug diversity of the target can be evaluated, and the potential off-target targets of the compound list can be predicted. Evaluation of drug cross-reactivity of multiple proteins, etc.\u003c/p\u003e \u003cp\u003eIn the field of data analysis, compounds are a class of data types that are rich in structural information and have high-dimensional feature representations. In the future, based on this data can be migrated to more knowledge-related types of sparse high-dimensional data for non-destructive spatial clustering display and data mining.\u003c/p\u003e"},{"header":"Method","content":"\u003cp\u003eData collection and labeling. Medium to high biomolecules with IC50, EC50 and Ki\u0026thinsp;\u0026le;\u0026thinsp;10 \u0026micro;M through experimental methods such as MTT assay, kinase activity test, etc., are selected as datasets, from the pharmacochemical database ChEMBL\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e,\u003cspan citationid=\"CR45\" class=\"CitationRef\"\u003e45\u003c/span\u003e\u003c/sup\u003e. It covers compounds from the preclinical to the approved stage. There are 1,943,048 biomolecules in total.\u003c/p\u003e \u003cp\u003eLabel standing for the verified target is used for visualization in 3DLSpace. Molecules with activity against four major drug target classes were queried, including kinase, protease, GPCR, nuclear receptor, etc. 124,632 biomolecules targeting kinase (red), 92,040 targeting protease (green), 33,718 targeting GPCR (Cyan) and 28,216 targeting nuclear receptor (purple) are obtained. Compounds with multiple target label values were excluded during 3D visualization. 1,658,503 biomolecules targeting other types of targets.\u003c/p\u003e \u003cp\u003eConstruction of molecular fingerprints Fmaps. We selected three molecular fingerprints with the highest citations in 1410 literature by 2024, according to the PubMed database, two substructure-key SMARTS-based features (PubChemFP, 192 bits and MACCSFP, 476 bits), and pharmacophore-based features (PharmacoPFP, 298 bits)\u003csup\u003e\u003cspan citationid=\"CR46\" class=\"CitationRef\"\u003e46\u003c/span\u003e\u003c/sup\u003e. These fingerprints are generated based on MolMap, which is a new method of molecular feature generation based on manifold learning\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e,\u003cspan citationid=\"CR47\" class=\"CitationRef\"\u003e47\u003c/span\u003e\u003c/sup\u003e. We use PyBioMed to remove part of the PubChem molecular fingerprint unique thermal code infrequently, reducing the original 881 dimensions to 733 dimensions\u003csup\u003e\u003cspan citationid=\"CR48\" class=\"CitationRef\"\u003e48\u003c/span\u003e\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eOptimization of the parameters of MolF-DAEs. A pair of complementary DNNs are adopted, with an encoder as an extractor to convert Fmaps into 3DLSpace for clustering the molecules, and a decoder to convert the latent codes back to the original FMaps for optimizing the autoencoder. Adam optimizer is adopted, with the loss function initially using binary cross entropy and MSE successively for 100 epochs training expectedly. The trend of the loss function (binary cross entropy (left) and MSE (right) process of loss function.\u003cdiv id=\"Equa\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equa\" name=\"EquationSource\"\u003e\n$$\\:{L}_{rec}=\\text{min}\\frac{1}{n}\\sum\\:_{i=1}^{n}{‖{x}_{i}-{\\varphi\\:}_{r}\\left({\\varphi\\:}_{e}\\left({x}_{i}\\right)\\right)‖}^{2}$$\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eWhere \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\varphi\\:}_{e}(.)\\)\u003c/span\u003e\u003c/span\u003e and \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:{\\varphi\\:}_{r}(.)\\)\u003c/span\u003e\u003c/span\u003erepresents the encoder network and decoder network of MolF-DAE respectively.\u003c/p\u003e \u003cp\u003eThe hyperparameters of the autoencoder were optimized by the tree-structured estimator approach in two phases\u003csup\u003e\u003cspan citationid=\"CR49\" class=\"CitationRef\"\u003e49\u003c/span\u003e\u003c/sup\u003e. In phase 1, the number of layers varied from 2 to 100, the number of nodes in the first hidden layer was set at 2048 followed by up to 50% reductions in the subsequent layers. It proceeds until the reconstruction rate reaches\u0026thinsp;\u0026gt;\u0026thinsp;95% (percent of position-to-position reconstruction of the original molecular fingerprints), which is comparable to the reported 95.3%-96.4% reconstruction rate of an autoencoder trained on 1,937,109 drug-like molecules\u003csup\u003e\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e. In phase 2, the number of nodes in the hidden layers was fine-tuned. Phase 2 proceeds until the reconstruction rate reaches optimal value. The optimal parameters are detailed in Supplementary Table\u0026nbsp;10. In PuChemFPM, the encoder consists of a 5-layer fully connected neural network (DNN). In Encoder, each dense layer is followed by a ReLU activation function. The final layer connects to a fully connected layer with 3 nodes. Each dense layer in the decoder is followed by a ReLU activation function, except for the last dense layer, which employs a Sigmoid activation function. The total parameter count is 2,697,564, approximately 2.7\u0026nbsp;million.\u003c/p\u003e \u003cp\u003eComparing the sizes of individual images, MACCSFP constitutes only a quarter of PubChemFP. Consequently, models with original parameters possess excessive total parameters for the new dataset, leading to suboptimal Fmaps reconstruction. Models with smaller total parameters were adopted, and the parameter space was gradually expanded to find the optimal parameters. During optimization, a total parameter count of 2\u0026nbsp;million yields better training results. The composition of the 2\u0026nbsp;million parameters network was continuously adjusted. It\u0026rsquo;s an unstable model in the task of reconstructing images from large datasets, although the variation range of the stabilized MSE is small and within an acceptable range. Therefore, the choice between these parameters has minimal impact on the final model selection. The influence of model network depth, encoder structure, and total parameter count on the reconstruction effect outweighs the impact of the model's inherent instability.\u003c/p\u003e \u003cp\u003eChempack Software Demonstration\u003c/p\u003e \u003cp\u003eAccessing spatial data linearly presents challenges in quickly addressing spatial queries during the data retrieval process and in conducting statistical analyses based on the macroscopic distribution of data in space. To address these issues, we have developed Chempack software for fast navigation and simultaneous display of the DAE-generated distribution landscapes of up to 50\u0026nbsp;million molecules in the 3-dimensional latent chemical space. The molecules within an on-screen subspace (a movable cubic box) are displayed as bright spheres (in default grey color) embedded in the black background space. Subsets of molecules may be highlighted by user-specified colors (via color selector). The molecules outside the moveable cubic box are un-displayed unless the box is moved into the local subspace. A multi-layer iterative data retrieval and display algorithm was employed for displaying the spheres within a local cubic box of 1/ 4N-1 (N\u0026thinsp;=\u0026thinsp;1\u0026ndash;8) of the volume of the global cubic box defined by the input spheres. The following procedure created the global and local cubic box. A user manual for Chempack is provided in the supplementary materials.\u003c/p\u003e \u003cp\u003eIn this paper, we utilize Chempack software to visualize 3DLSpace by inputting three numerical values as coordinate values. With the addition of labels previously annotated for 1.9\u0026nbsp;million chemical biomolecules, these serve as the original data coordinates, enabling visualization in 3DLSpace. Different labels are represented by different colors for visual differentiation. Chempack software assigns distinct colors to different target types, where each target type corresponds to a specific color label: kinase (red), protease (green), nuclear receptor (purple), G protein-coupled receptor (GPCR) (cyan), and other targets (gray).\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eAcknowledgments\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors acknowledge the financial support from the National Key Research and Development Program of China (grant number: 2023YFA0913600) and Research Fund of Guangdong Province (grant number: 2024A1515011906). \u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll code used in this paper is open source. Code for MolF-DAEs is available via GitHub at https://github.com/HuazhangYing/MolF-DAEs. Code for feature representation is available via GitHub at https://github.com/shenwanxiang/bidd-molmap. The dataset for 1.9 million bioactivate molecules is available at https://drive.google.com/drive/folders/1oPpz3biogeAEoa9-RydxdLDBcZQd-6o6?usp=drive_link. Resources for the 3D visualization software Chempack are openly accessible. For access or further information, please contact the authors.\u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eY.T., Y.C., and C.Q. conceived the project. H.Y. and X.W. develop the methodology, and performed experiments and analysis. H.Y. wrote the paper. H.Y., Y.C. and Y.T. contributed to the revision of the manuscript. \u003c/p\u003e\n\n\u003cp\u003e\u003cstrong\u003eCompeting Interests statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eOyedotun, O. \u0026amp; Dimililer, K. Pattern Recognition: Invariance Learning in Convolutional Auto Encoder Network. \u003cem\u003eInternational Journal of Image, Graphics and Signal Processing\u003c/em\u003e 8, 19-27 (2016). https://doi.org:10.5815/ijigsp.2016.03.03\u003c/li\u003e\n\u003cli\u003eDundar, A., Jin, J. \u0026amp; Culurciello, E. Convolutional Clustering for Unsupervised Learning. \u003cem\u003eArXiv\u003c/em\u003e abs/1511.06241 (2015). \u003c/li\u003e\n\u003cli\u003ePolanski, J. Unsupervised Learning in Drug Design from Self-Organization to Deep Chemistry. \u003cem\u003eInternational Journal of Molecular Sciences\u003c/em\u003e 23, 2797 (2022). \u003c/li\u003e\n\u003cli\u003eG\u0026oacute;mez-Bombarelli, R.\u003cem\u003e et al.\u003c/em\u003e Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules. \u003cem\u003eACS Central Science\u003c/em\u003e 4, 268-276 (2018). https://doi.org:10.1021/acscentsci.7b00572\u003c/li\u003e\n\u003cli\u003eClarke, R.\u003cem\u003e et al.\u003c/em\u003e The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. \u003cem\u003eNature Reviews Cancer\u003c/em\u003e 8, 37-49 (2008). https://doi.org:10.1038/nrc2294\u003c/li\u003e\n\u003cli\u003eChen, L., Saykin, A. J., Yao, B. \u0026amp; Zhao, F. Multi-task deep autoencoder to predict Alzheimer\u0026rsquo;s disease progression using temporal DNA methylation data in peripheral blood. \u003cem\u003eComputational and Structural Biotechnology Journal\u003c/em\u003e 20, 5761-5774 (2022). https://doi.org:https://doi.org/10.1016/j.csbj.2022.10.016\u003c/li\u003e\n\u003cli\u003eMullowney, M. W.\u003cem\u003e et al.\u003c/em\u003e Artificial intelligence for natural product drug discovery. \u003cem\u003eNature Reviews Drug Discovery\u003c/em\u003e 22, 895-916 (2023). https://doi.org:10.1038/s41573-023-00774-7\u003c/li\u003e\n\u003cli\u003eTropsha, A., Isayev, O., Varnek, A., Schneider, G. \u0026amp; Cherkasov, A. Integrating QSAR modelling and deep learning in drug discovery: the emergence of deep QSAR. \u003cem\u003eNature Reviews Drug Discovery\u003c/em\u003e 23, 141-155 (2024). https://doi.org:10.1038/s41573-023-00832-0\u003c/li\u003e\n\u003cli\u003eSong, C., Liu, F., Huang, Y., Wang, L. \u0026amp; Tan, T. in \u003cem\u003eProceedings, Part I, of the 18th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications - Volume 8258\u003c/em\u003e 117\u0026ndash;124 (Springer-Verlag, Havana, Cuba, 2013).\u003c/li\u003e\n\u003cli\u003eWalters, S. J. \u0026amp; Campbell, M. J. The use of bootstrap methods for analysing Health-Related Quality of Life outcomes (particularly the SF-36). \u003cem\u003eHealth Qual Life Outcomes\u003c/em\u003e 2, 70 (2004). https://doi.org:10.1186/1477-7525-2-70\u003c/li\u003e\n\u003cli\u003eEckhardt, C. M.\u003cem\u003e et al.\u003c/em\u003e Unsupervised machine learning methods and emerging applications in healthcare. \u003cem\u003eKnee Surg Sports Traumatol Arthrosc\u003c/em\u003e 31, 376-381 (2023). https://doi.org:10.1007/s00167-022-07233-7\u003c/li\u003e\n\u003cli\u003eAltman, N. \u0026amp; Krzywinski, M. Clustering. \u003cem\u003eNature Methods\u003c/em\u003e 14, 545-546 (2017). https://doi.org:10.1038/nmeth.4299\u003c/li\u003e\n\u003cli\u003eKamal, I. M. \u0026amp; Bae, H. Super-encoder with cooperative autoencoder networks. \u003cem\u003ePattern Recognition\u003c/em\u003e 126, 108562 (2022). https://doi.org:https://doi.org/10.1016/j.patcog.2022.108562\u003c/li\u003e\n\u003cli\u003eHinton, G. E. \u0026amp; Salakhutdinov, R. R. Reducing the Dimensionality of Data with Neural Networks. \u003cem\u003eScience\u003c/em\u003e 313, 504-507 (2006). https://doi.org:doi:10.1126/science.1127647\u003c/li\u003e\n\u003cli\u003eSteinbach, M., Ert\u0026ouml;z, L. \u0026amp; Kumar, V. in \u003cem\u003eNew Directions in Statistical Physics: Econophysics, Bioinformatics, and Pattern Recognition\u003c/em\u003e (ed Luc T. Wille) 273-309 (Springer Berlin Heidelberg, 2004).\u003c/li\u003e\n\u003cli\u003eYu, W., Wang, R., Nie, F. \u0026amp; Wang, F. Multi-view embedded clustering with unsupervised trace ratio LDA. \u003cem\u003eNeurocomputing\u003c/em\u003e 315, 169-176 (2018). https://doi.org:https://doi.org/10.1016/j.neucom.2018.07.014\u003c/li\u003e\n\u003cli\u003eStumpfe, D. \u0026amp; Bajorath, J. Exploring Activity Cliffs in Medicinal Chemistry. \u003cem\u003eJournal of Medicinal Chemistry\u003c/em\u003e 55, 2932-2942 (2012). https://doi.org:10.1021/jm201706b\u003c/li\u003e\n\u003cli\u003eHan, Z.\u003cem\u003e et al.\u003c/em\u003e Mesh Convolutional Restricted Boltzmann Machines for Unsupervised Learning of Features With Structure Preservation on 3-D Meshes. \u003cem\u003eIEEE Trans Neural Netw Learn Syst\u003c/em\u003e 28, 2268-2281 (2017). https://doi.org:10.1109/tnnls.2016.2582532\u003c/li\u003e\n\u003cli\u003eGuo, X., Liu, X., Zhu, E. \u0026amp; Yin, J. 373-382 (Springer International Publishing).\u003c/li\u003e\n\u003cli\u003eZdrazil, B.\u003cem\u003e et al.\u003c/em\u003e The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. \u003cem\u003eNucleic Acids Research\u003c/em\u003e 52, D1180-D1192 (2023). https://doi.org:10.1093/nar/gkad1004\u003c/li\u003e\n\u003cli\u003eShen, W. X.\u003cem\u003e et al.\u003c/em\u003e Out-of-the-box deep learning prediction of pharmaceutical properties by broadly learned knowledge-based molecular representations. \u003cem\u003eNature Machine Intelligence\u003c/em\u003e 3, 334-343 (2021). https://doi.org:10.1038/s42256-021-00301-6\u003c/li\u003e\n\u003cli\u003eIlnicka, A. \u0026amp; Schneider, G. Compression of molecular fingerprints with autoencoder networks. \u003cem\u003eMolecular Informatics\u003c/em\u003e 42, 2300059 (2023). https://doi.org:https://doi.org/10.1002/minf.202300059\u003c/li\u003e\n\u003cli\u003eMoon, K. R.\u003cem\u003e et al.\u003c/em\u003e Visualizing structure and transitions in high-dimensional biological data. \u003cem\u003eNature Biotechnology\u003c/em\u003e 37, 1482-1492 (2019). https://doi.org:10.1038/s41587-019-0336-3\u003c/li\u003e\n\u003cli\u003eMcInnes, L. \u0026amp; Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. \u003cem\u003eArXiv\u003c/em\u003e abs/1802.03426 (2018). \u003c/li\u003e\n\u003cli\u003eLi, Y. H.\u003cem\u003e et al.\u003c/em\u003e Clinical trials, progression-speed differentiating features and swiftness rule of the innovative targets of first-in-class drugs. \u003cem\u003eBriefings in Bioinformatics\u003c/em\u003e 21, 649-662 (2019). https://doi.org:10.1093/bib/bby130\u003c/li\u003e\n\u003cli\u003eVasaikar, S., Bhatia, P., Bhatia, P. \u0026amp; Yaiw, K.-C. Complementary Approaches to Existing Target Based Drug Discovery for Identifying Novel Drug Targets. \u003cem\u003eBiomedicines\u003c/em\u003e 4, 27 (2016). https://doi.org:10.3390/biomedicines4040027\u003c/li\u003e\n\u003cli\u003eAttwood, M. M., Fabbro, D., Sokolov, A. V., Knapp, S. \u0026amp; Schi\u0026ouml;th, H. B. Trends in kinase drug discovery: targets, indications and inhibitor design. \u003cem\u003eNature Reviews Drug Discovery\u003c/em\u003e 20, 839-861 (2021). https://doi.org:10.1038/s41573-021-00252-y\u003c/li\u003e\n\u003cli\u003eKumar, V.\u003cem\u003e et al.\u003c/em\u003e Role of Tyrosine Kinases and their Inhibitors in Cancer Therapy: A Comprehensive Review. \u003cem\u003eCurr Med Chem\u003c/em\u003e 30, 1464-1481 (2023). https://doi.org:10.2174/0929867329666220727122952\u003c/li\u003e\n\u003cli\u003eShaban, N., Kamashev, D., Emelianova, A. \u0026amp; Buzdin, A. Targeted Inhibitors of EGFR: Structure, Biology, Biomarkers, and Clinical Applications. \u003cem\u003eCells\u003c/em\u003e 13 (2023). https://doi.org:10.3390/cells13010047\u003c/li\u003e\n\u003cli\u003eSantiago, A. D. S.\u003cem\u003e et al.\u003c/em\u003e Structural Analysis of Inhibitor Binding to CAMKK1 Identifies Features Necessary for Design of Specific Inhibitors. \u003cem\u003eSci Rep\u003c/em\u003e 8, 14800 (2018). https://doi.org:10.1038/s41598-018-33043-4\u003c/li\u003e\n\u003cli\u003eProfeta, G. S.\u003cem\u003e et al.\u003c/em\u003e Binding and structural analyses of potent inhibitors of the human Ca(2+)/calmodulin dependent protein kinase kinase 2 (CAMKK2) identified from a collection of commercially-available kinase inhibitors. \u003cem\u003eSci Rep\u003c/em\u003e 9, 16452 (2019). https://doi.org:10.1038/s41598-019-52795-1\u003c/li\u003e\n\u003cli\u003eBickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. \u0026amp; Hopkins, A. L. Quantifying the chemical beauty of drugs. \u003cem\u003eNature Chemistry\u003c/em\u003e 4, 90-98 (2012). https://doi.org:10.1038/nchem.1243\u003c/li\u003e\n\u003cli\u003eBondensgaard, K.\u003cem\u003e et al.\u003c/em\u003e Recognition of Privileged Structures by G-Protein Coupled Receptors. \u003cem\u003eJournal of Medicinal Chemistry\u003c/em\u003e 47, 888-899 (2004). https://doi.org:10.1021/jm0309452\u003c/li\u003e\n\u003cli\u003eLiao, J. J.-L. Molecular Recognition of Protein Kinase Binding Pockets for Design of Potent and Selective Kinase Inhibitors. \u003cem\u003eJournal of Medicinal Chemistry\u003c/em\u003e 50, 409-424 (2007). https://doi.org:10.1021/jm0608107\u003c/li\u003e\n\u003cli\u003evan Linden, O. P. J., Kooistra, A. J., Leurs, R., de Esch, I. J. P. \u0026amp; de Graaf, C. KLIFS: A Knowledge-Based Structural Database To Navigate Kinase\u0026ndash;Ligand Interaction Space. \u003cem\u003eJournal of Medicinal Chemistry\u003c/em\u003e 57, 249-277 (2014). https://doi.org:10.1021/jm400378w\u003c/li\u003e\n\u003cli\u003eAronov, A. M., McClain, B., Moody, C. S. \u0026amp; Murcko, M. A. Kinase-likeness and Kinase-Privileged Fragments: Toward Virtual Polypharmacology. \u003cem\u003eJournal of Medicinal Chemistry\u003c/em\u003e 51, 1214-1222 (2008). https://doi.org:10.1021/jm701021b\u003c/li\u003e\n\u003cli\u003eSchneider, P. \u0026amp; Schneider, G. Privileged Structures Revisited. \u003cem\u003eAngewandte Chemie International Edition\u003c/em\u003e 56, 7971-7974 (2017). https://doi.org:https://doi.org/10.1002/anie.201702816\u003c/li\u003e\n\u003cli\u003eIlnicka, A. \u0026amp; Schneider, G. Designing molecules with autoencoder networks. \u003cem\u003eNature Computational Science\u003c/em\u003e 3, 922-933 (2023). https://doi.org:10.1038/s43588-023-00548-6\u003c/li\u003e\n\u003cli\u003eGebru, T.\u003cem\u003e et al.\u003c/em\u003e Using deep learning and Google Street View to estimate the demographic makeup of neighborhoods across the United States. \u003cem\u003eProceedings of the National Academy of Sciences\u003c/em\u003e 114, 13108-13113 (2017). https://doi.org:doi:10.1073/pnas.1700035114\u003c/li\u003e\n\u003cli\u003eSilver, D.\u003cem\u003e et al.\u003c/em\u003e Mastering the game of Go without human knowledge. \u003cem\u003eNature\u003c/em\u003e 550, 354-359 (2017). https://doi.org:10.1038/nature24270\u003c/li\u003e\n\u003cli\u003eLu, S. \u0026amp; Li, R. DAC: Deep Autoencoder-based Clustering, a General Deep Learning Framework of Representation Learning. \u003cem\u003eArXiv\u003c/em\u003e abs/2102.07472 (2021). \u003c/li\u003e\n\u003cli\u003eRen, Y.\u003cem\u003e et al.\u003c/em\u003e Deep Clustering: A Comprehensive Survey. \u003cem\u003eIEEE Transactions on Neural Networks and Learning Systems\u003c/em\u003e, 1-21 (2024). https://doi.org:10.1109/TNNLS.2024.3403155\u003c/li\u003e\n\u003cli\u003eXie, J., Girshick, R. \u0026amp; Farhadi, A. in \u003cem\u003eProceedings of The 33rd International Conference on Machine Learning\u003c/em\u003e Vol. 48 (eds Balcan Maria Florina \u0026amp; Q. Weinberger Kilian) 478--487 (PMLR, Proceedings of Machine Learning Research, 2016).\u003c/li\u003e\n\u003cli\u003eMendez, D.\u003cem\u003e et al.\u003c/em\u003e ChEMBL: towards direct deposition of bioassay data. \u003cem\u003eNucleic Acids Res\u003c/em\u003e 47, D930-d940 (2019). https://doi.org:10.1093/nar/gky1075\u003c/li\u003e\n\u003cli\u003eGaulton, A.\u003cem\u003e et al.\u003c/em\u003e The ChEMBL database in 2017. \u003cem\u003eNucleic Acids Res\u003c/em\u003e 45, D945-d954 (2017). https://doi.org:10.1093/nar/gkw1074\u003c/li\u003e\n\u003cli\u003eYang, K.\u003cem\u003e et al.\u003c/em\u003e Analyzing Learned Molecular Representations for Property Prediction. \u003cem\u003eJournal of Chemical Information and Modeling\u003c/em\u003e 59, 3370-3388 (2019). https://doi.org:10.1021/acs.jcim.9b00237\u003c/li\u003e\n\u003cli\u003eShen, W. X.\u003cem\u003e et al.\u003c/em\u003e AggMapNet: enhanced and explainable low-sample omics deep learning with feature-aggregated multi-channel networks. \u003cem\u003eNucleic Acids Research\u003c/em\u003e 50, e45-e45 (2022). https://doi.org:10.1093/nar/gkac010\u003c/li\u003e\n\u003cli\u003eDong, J.\u003cem\u003e et al.\u003c/em\u003e PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. \u003cem\u003eJ Cheminform\u003c/em\u003e 10, 16 (2018). https://doi.org:10.1186/s13321-018-0270-2\u003c/li\u003e\n\u003cli\u003e\u003cem\u003eProceedings of the 24th International Conference on Neural Information Processing Systems\u003c/em\u003e. (Curran Associates Inc., 2011).\u003c/li\u003e\n\u003cli\u003eKanev, G. K., de Graaf, C., Westerman, B. A., de Esch, I. J. P. \u0026amp; Kooistra, A. J. KLIFS: an overhaul after the first 5 years of supporting kinase research. \u003cem\u003eNucleic Acids Research\u003c/em\u003e 49, D562-D569 (2020). https://doi.org:10.1093/nar/gkaa895\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Bioactive molecule, DAEs, substructure","lastPublishedDoi":"10.21203/rs.3.rs-6755378/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6755378/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eUnsupervised deep autoencoders (DAEs) are useful for data clustering and visualization. DAE-derived data clusters are typically visualized by dimensionality reduction methods, which have some degree of visual distortions that pose difficulties in revealing intrinsic cluster patterns. Here, we developed substructure-based molecular-fingerprint DAEs (MolF-DAEs) to cluster 1.9\u0026nbsp;million bioactive molecules (biomolecules) in 3D latent space (3DLSpace), where data clusters can be straightforwardly visualized. MolF-DAEs developed with three established sets of molecular fingerprints consistently cluster biomolecules with 96.1\u0026ndash;97.6% reconstruction rate. In 3DLSpace, the biomolecules cluster into novel substructure-distinguished bioactivity-relevant band-shaped clusters. Each cluster is dominated by the biomolecules of specific substructure combinations. These in-cluster biomolecules are of varying molecular structures but frequently form a limited number of bioactivity classes. Our study suggests that unsupervised deep clustering in 3DLSpace is useful for visually revealing the intrinsic data distribution patterns and functionally relevant data clusters.\u003c/p\u003e","manuscriptTitle":"Molecular-substructure Deep Autoencoders Cluster Biomolecules into Novel Band-Shaped Substructure-Distinguished Bioactivity Clusters in 3D Latent Space","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-16 12:01:52","doi":"10.21203/rs.3.rs-6755378/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"communications-chemistry","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"commschem","sideBox":"Learn more about [Communications Chemistry](http://www.nature.com/commschem/)","snPcode":"","submissionUrl":"","title":"Communications Chemistry","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Communications Series","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"dd4b7ff3-f68d-44ec-a053-4278ad9eeb7c","owner":[],"postedDate":"June 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":49953660,"name":"Biological sciences/Chemical biology/Cheminformatics"},{"id":49953661,"name":"Biological sciences/Chemical biology/Computational chemistry"}],"tags":[],"updatedAt":"2025-07-11T12:01:47+00:00","versionOfRecord":[],"versionCreatedAt":"2025-06-16 12:01:52","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6755378","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6755378","identity":"rs-6755378","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00