Using Quantum Atomics and Machine Learning to Advance Picotechnology | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Using Quantum Atomics and Machine Learning to Advance Picotechnology Preston J. MacDougall, Kiran K. Donthula This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-4669576/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 26 Sep, 2024 Read the published version in Theoretical Chemistry Accounts → Version 1 posted 11 You are reading this latest preprint version Abstract We explore the use of machine learning to predict spectroscopic properties and interaction energies of the carbonyl groups in 225 ketones, aldehydes, imides, and amides. In the combined spirit of Density Functional Theory (DFT) and the Quantum Theory of Atoms in Molecules (QTAIM), but with an eye toward eventually using databases of transferable fragment densities, we limit the training data to small sets of descriptors (from 18 to 48 per molecule) that are based on topological features in the total charge density, ρ, and/or its Laplacian, ∇2ρ. We obtain a mean absolute error under 1% for carbonyl stretching frequencies, and just over 1% for C-13 NMR shifts. Predicting interaction energies with a model nucleophile (fluoride ion) is significantly more challenging. Mean absolute errors just over 3 kcal/mol were obtained for covalent bond formation energies. Similar mean absolute errors were obtained for much weaker van der Waals interaction energies. We also conducted a stress-test to see if our small molecule-based machine learning could predict covalent bond formation energy in a model of the active site of the E. coli enzyme, D-fructose-6-phosphate aldolase. QTAIM charge density topology picotechnology quantum atomics machine learning Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 WHAT IS PICOTECHNOLOGY? Nanotechnology is the scientific investigation of matter at the nanometer scale (1 nm = 10 -9 m) – roughly from one to hundreds of nanometers – and engineering systems that take advantage of the properties of matter at that scale. Drug-like organic molecules have dimensions measuring up to just a few nanometers. So, dating back to the mid-1800s, organic chemistry is the oldest form of nanotechnology. Similarly, but a century or so younger, femtotechnology corresponds to the scientific investigation of matter at the femtometer scale (1 fm = 10 -15 m) and engineering systems that take advantage of the properties of matter at that scale. It is better known as nuclear engineering. Any future picotechnology should therefore be defined as the scientific investigation of matter at the picometer scale (1 pm = 10 -12 m) – from one to hundreds of picometers - and engineering systems that take advantage of the properties of matter at that scale. In other words, atomic engineering ! Since most atoms have diameters over 100 pm, picotechnology is necessarily concerned with subatomic features (outside the nucleus). Chemists routinely refer to such features conceptually, such as when using arrow-pushing diagrams to show a nonbonding electron pair on one atom filling an unoccupied atomic orbital on another atom, forming a new covalent bond. Technologies, however, are not purely conceptual. They also have physical manifestations. Physical, reproducible, and chemically significant subatomic features have been identified and in the topology of the Laplacian of the charge density, Ñ2r, for atoms from all across the periodic table, and in many forms of matter, by both computational and experimental researchers alike [1-4]. Pioneering chemical informatics work by Alsberg et al. noted that “The (QTAIM) approach is well suited for machine inference as it is topologically based and close to the naïve atom-bond current ILP (inductive logic programming) representation.” [5]. Concurrently, Popelier et al. began developing Quantum Topological Molecular Similarity (QTMS) with charge density topological data (at bond critical points) to predict a variety of empirical chemical parameters, such as Hammett constants for benzoic acids derivatives [6]. These and similar studies motivated us to consider whether other charge density descriptors could prove useful in training neural networks to predict more localized and directional chemical properties, such as mapping van der Waals interaction energies, which are key to drug design and modelling biomolecular interactions. For our neural network training data, in addition to the bond critical point data that others have used, we also include topological features in Ñ2r, specifically within the valence-shell charge concentration (VSCC) of carbon atoms in carbonyl groups (Fig. 2). Such subatomic topological features have previously been shown to predict Bürgi-Dunitz angles of nucleophilic attack [7], and even changes to that angle upon protonation of the carbonyl oxygen [1]. However, like all other wavefunction-based quantum chemical modelling of chemical reactivity, the topological analysis of localized sites of attack first required time-consuming Coulomb and exchange-correlation integrals spanning the entire molecule. Even using DFT, which omits the latter and most time-consuming integrals, if the carbonyl group of interest is part of a very large protein complex, then most computation time is wasted on computing the charge density around atoms unrelated to the reactive site. QM/MM embedding methods such as ONIOM [8] can further economize by focusing high-level computing effort on the reactive site, but that is not the objective of the current work. Our goal is to investigate the feasibility of training neural networks to predict carbonyl reactivity using descriptors based on topological properties of subatomic features within the VSCC of carbonyl carbon atoms (measured in picometers) – regardless of how the charge density is computed. It could, in fact, be measured with high-resolution, single-crystal X-ray diffraction [4]. Ultimately, because the charge densities of functional groups can be very transferable between molecules [2] (even when they differ greatly on the periphery), we expect that in the future picotechnology that we envision, the necessary descriptor subatomic topological features will be very rapidly and accurately constructed from fragment databases [9]. Thus, when picotechnology is mature, no quantum chemical calculations, or high-resolution X-ray diffraction experiments will be required to predict chemical reactivity at the reactive site of a molecule. COMPUTATIONAL METHODS 2.1Quantum Chemistry The initial Cartesian coordinates of 225 carbonyl compounds, including aldehydes, ketones and amides were obtained using molecular modelling software Spartan’10 [10]. The molecules used in this study vary in size ranging from 7 to 31 atoms. A complete list of compounds included in the training set can be found in Supplementary Materials (Table 1). Full geometry optimization (of default Spartan-generated conformers) and generation of the wavefunction files were carried out using Gaussian09 [11]. All the molecules were optimized at B3LYP/6-31+G* and wavefunctions were obtained at M05-2X/6-311++G** level of theory [12-16]. The wavefunctions obtained from these calculations were used to calculate the topological properties of the molecules such as electron density, Laplacian of electron density, Hessian eigenvalues (of the corresponding Hessian matrix, r or Ñ2r), bond critical points, and critical points in Ñ2r. The topological properties of electron density at bond critical points were determined using AIMAll [17].Whereas the topological properties at critical points of the Laplacian of the charge density were determined using Denprop [18]. The topological data used for descriptors in neural network training were extracted using a Python script to create the data sets. The types of critical points that are located and used as training data in this research are illustrated in Figures 1 and 2. 2.2 Data Sets and Quantum Atomics Quantum Atomics refers to topological and/or integrated atomic data obtained via, or within the context of the Quantum Theory of Atoms in Molecules (QTAIM). Numerous computational and experimental groups worldwide have been collecting such data for decades, without referring to it as such. It is perhaps a useful term for distinguishing such data from other computational chemistry and crystallographic parameters that are routinely used, but lack the rigorous foundation of QTAIM. A second over-arching goal of this research is to explore the information content “quality” of various types of Quantum Atomic data for specific machine learning goals. For instance, if the goal is to predict chemical reactivity, should training data include topological properties of the charge density, or the Laplacian of the charge density? Although we have not explored such questions here, future research could address whether more efficient and accurate machine learning can be achieved with specific integrated atomic properties that are defined uniquely within QTAIM, such as atomic energy, magnetic susceptibility, electric polarizability, multipole moments, etc… In this research, a total of nine separate data sets were used: three types topological data (BCP, LCP, combined) trained with three types of physicochemical data (C-13 NMR, C=O frequencies, nucleophile interaction energies) with each data set containing critical point descriptors for all 225 molecules were used to train artificial neural networks (ANNs) to predict C-13 chemical shifts, C=O stretching frequencies, and interaction energies with a model nucleophile (the fluoride ion). Among the nine data sets were three of each of the following: bond critical point (BCPs), Laplacian critical point (LCPs), and combined data sets (with both BCP and LCP data). Each of the previously mentioned data sets contains the following: a class label from experimental C-13 shifts, C=O stretching frequencies, or theoretical interaction energy values. The experimental C-13 chemical shift in ppm and C=O stretching frequency in cm -1 values were collected from spectral database of organic compounds library (SDBS) [19]. All topological data in the training sets can be found in Supplementary Materials (Table 2). An example of the LCP portion of such a data set for 1-butanal is shown in Table 1. Table 1. Sample Laplacian critical point (LCP) input data for 2-chloro-4-fluorobenzaldehyde (all units au) Distance from C nucleus λ 1 λ 2 λ 3 ρ ∇ 2 ρ 1.0282 -1.140 -0.614 10.800 0.141 0.0294 1.0282 -1.140 -0.614 10.800 0.141 0.0294 0.9701 4.300 5.060 22.000 0.292 -1.0100 0.9790 6.480 6.900 18.800 0.302 -1.2300 0.9811 5.230 8.960 18.000 0.431 -1.0700 The interaction energy (Δ E interaction ) between carbonyl compounds and a fluoride ion (F -1 ) in the nucleophilic addition reaction was calculated at B3LYP/6-31+G* level with the following approach, Δ E interaction = E CC+F - E CC - E F (Eqn. 1) where E CC+F , E CC and E F are total energies of the carbanion complex, carbonyl compound, and the fluoride ion nucleophile, respectively. Basis set superposition errors (BSSE) were not calculated,as they would have considerably increased computational time, while yielding no significant value. The machine learning here is based on properties of the electron density, which are completely unaffected by BSSE. 2.3 Artificial Neural Network Model Over the past few decade artificial neural networks (ANNs) have had huge success in machine learning and data mining applications [20]. Recently, Handley and Popelier have pioneered use of QTAIM data (atomic multipole moments) in conjunction with machine learning to model the fluctuating polarizability of water molecules in molecular dynamics simulations [21]. ANNs are powerful tools in advanced computing that analyse information quantitatively by learning from training data. The important properties of ANNs are the learning ability of a network from its environment and improving performance by learning. A learning algorithm is a procedure in which the learning rules are used to adjust the weights. The ANNs consists of input layer, one or more hidden layers and output layer. The input signal propagates layer by layer in the forward direction and these networks are commonly called as multilayer perceptrons (MLP) [22,23]. MLP with back-propagation learning method (BP) is one of the successfully used methods in chemistry and drug design because of its well-defined and explicit set of equations for weight corrections [24]. This is a supervised learning algorithm in which the network is trained with a training data with expected outputs are provided to get the algorithm trained. The learning consists of a forward pass and backward pass. In the forward pass, the input vector is applied to the input layer and these input values are modified by a fixed weight and its effect passes through the network layer by layer. Finally, output produced by the network compared with desired output to calculate the error signal. This error signal is then back propagated in the backward pass to adjust the weights in such a way that the actual output value move closer to the desired output value according to error correction rule [22]. In this study, the machine learning package WEKA 3.6.13 was used for the ANN model development [25]. In our model, we used 1 hidden layer with 9 neurons configuration when it was trained with bond critical point data and 1 hidden layer with 15 neurons configuration when the model was trained with Laplacian critical point data. The predictive ability of the network is determined by validation techniques. We have used the leave-one-out cross-validation technique, in which the whole data set was divided into 225 pieces, 224 pieces used for training and one piece for testing. The mean absolute percent errors were calculated by comparing predicted values with actual values. Proof-of-concept Spectroscopic Predictions As a proof-of-concept exercise, we first attempted to predict routinely measured infrared stretching frequencies of the carbonyl bonds and the 13 C chemical shift of the carbonyl carbons. Such properties can be accurately approximated by semi-empirical wavefunction-based methods, but their prediction with spectroscopic precision using true ab initio methods is still quite challenging. Development of alternative and highly efficient methodologies for predicting even these properties is still of interest. Table 2 presents a summary of results using optimized machine learning parameters within WEKA. Default parameters yield results of significantly lower quality. While nothing close to spectroscopic accuracy, the results are promising considering that data for only 3 BCPs and 5 LCPs were used for each compound. Predictions for individual carbonyl types, using default and optimized parameters (learning rate, number of epochs and hidden neurons), as well as the optimized ANN parameters themselves, can be found in Supplementary Materials (Tables 3, 4, and 5, respectively). For instance, the Mean Absolute Percentage Error (MAPE) in predicting C=O stretching frequency for the 100+ ketones only , was 0.41% (better than results for all carbonyl types, Table 2). It is interesting to note that, as expected, the combined (BCP and LCP) data set yields better predictions, on average. And while there is insignificant difference between the BCP and LCP data sets in predicting C=O stretching frequencies, the topological properties of the Laplacian of the charge density appears to be “better” training data for predicting chemical shifts. This, too, may not be unexpected since the NMR chemical shift of a nucleus depends on shielding over the entire volume of the carbon atom, not just in the s-plane, where all the BCPs are located. Table 2. MAPE of predicted values of C13 chemical shifts and C=O stretching frequencies for 225 carbonyl compounds Data set C13 chemical shift C=O stretching frequency Laplacian critical point data 1.335 0.659 Bond critical point data 1.596 0.652 Combined data 1.308 0.641 Nucleophilic Interaction Energy Predictions Here, we report the prediction of both covalent and van der Waals interaction energies using ANNs trained on r and/or Ñ2r critical point descriptors. A nucleophilic addition reaction between a fluoride ion and a carbonyl group was taken as a model chemical reaction for our investigation of machine learning of chemical reactivity. As shown in Figure 3, preliminary investigations of reaction energy profiles, at multiple levels of theory, between a fluoride ion and acetone revealed dual minima; one at a short C-F distance (“covalent” bond formation), and one at a long C-F distance (van der Waals interaction). The interaction energies (Eqn. 1) were calculated for both strong (covalent bond formation) and weak (van der Waals) interactions for our set of 225 carbonyl-containing molecules. Our ANN was then trained on this data, using BCP, LCP, and combined data sets, as before. A summary of Mean Absolute Error (MAE) results for leave-one-out cross-validation machine learning predictions of interaction energies are shown in Table 3. Table 3. MAE of predicted interaction energies with optimum parameters for training the ANN model Data set covalent interaction energy (kcal/mol) van der Waals interaction energy (kcal/mol) Laplacian critical point data 2.63 5.06 Bond critical point data 3.39 4.80 Combined data 2.56 4.78 The ANN model yields much better predictions for covalent, as opposed to van der Waals interaction energies. This is an unsurprising result, since machine learning is highly dependent on the quality of the training data, and the B3LYP functional is well-known to poorly reproduce the dispersion energy, which is critical to van der Waals interactions [26]. Since DFT has become a standard workhorse in molecular modelling, future research in machine learning based on charge density descriptors that have been derived from more generally reliable functionals is warranted. With regard to the covalent interaction energy predictions of the optimized ANN model, MAEs are above the desired chemical accuracy that can be achieved easily with inexpensive empirical methods, but they support the idea that there is very high-quality information content in just a small number of charge density descriptors. We note in particular that LCP data is significantly better at predicting covalent interaction energies than the commonly used BCP data. This should not be surprising, because although the topological properties of the charge density led to the development of QTAIM in the first place, it is the topological properties of the Laplacian of the charge density that have been more useful in subsequent [27], as well as ongoing, studies of chemical reactivity based on the charge density [28]. This is perhaps the most salient observation from the current investigation: Not only is the topology of the Laplacian of the charge density more complex than that of the electron density (many more critical points), but the information content at these critical points is higher when the objective is to predict chemical reactivity. Predictions for individual carbonyl types, using default and optimized parameters (learning rate, number of epochs and hidden neurons), as well as the optimized ANN parameters themselves, can be found in Supplementary Materials (Tables 6, 7, and 8, respectively). stress-test: predicting interaction energies of substrates in an enzyme’s active site The accuracy of our ANN model predictions of the spectroscopic properties and chemical reactivity of the 225 carbonyl-containing compounds studied in this research is not the competitive with standard quantum chemical or even semi-empirical methods. And our ANN model is based on training data that is derived from the aforementioned quantum chemical methods. However, if developed further, the type of ANN model that we have explored will become competitive when the chemical systems become too big to do rapid conventional computational chemistry of any kind. Our ANN model depends only on topological properties of the charge density, which is measurable, and only at the site of interaction. While it is necessary, currently, to have very high-resolution single-crystal X-ray diffraction data [4] to determine the topological properties of the charge density that our ANN model is trained with, progress is being made in Quantum Crystallographic databases [29] that would enable very rapid approximation of the topological properties required to train or apply our ANN model, and for a molecule of any size , as long as the nuclear coordinates are known, either by low-resolution structure determination methods, or by rudimentary classical molecular mechanics. With a leap of faith into this hopeful future, we tested the accuracy of our small molecule-trained ANN model in a stress-test involving nucleophilic addition to a natural carbonyl-containing substrate of the E. coli enzyme, D-fructose-6-phosphate aldolase (FSA), which catalyzes such a nucleophilic addition, and which has a known molecular structure with a bound substrate (glycerol) [30]. We replaced glycerol with 3-hydroxypropanal (3HP), which nucleophilic addition to is catalyzed naturally by FSA, retaining hydrogen bonding contacts between the hydroxyl group and residues Asn28 and Asp6 (Residues 4 and 5, respectively, in Fig. 4). We then optimized the substrate in a fixed binding pocket. The ∇ 2 ρ = 0.0 isosurface for the Laplacian of the charge density for 3HP in the binding pocket is shown in Figure 4. Many π-holes are evident, such as those in the VSCC of carbons in the aromatic ring of tyrosine (Residue 2 in Fig. 4), which measure about 20 picometers in diameter, or less, if there is a π-hole at all. As expected, the largest π-hole is on the carbonyl carbon of 3HP (near the center of the cluster in Figure 4, bottom), which, after all, is the target of the nucleophilic attack. Its π-hole measures approximately 50 picometers in diameter, and is also shown in a cross-sectional view in Figure 5, both before and after covalent bond formation. These well-defined, observable, and often transferable subatomic features, which are made evident by the Laplacian of the charge density, are clearly related to the chemical properties of the respective atoms. Data associated with these features, such as their sizes, locations, and topological properties, all fall under the label of Quantum Atomics. Their size, in particular, also serves to illustrate and justify at least the prefix of the term picotechnology. Eqn. 2 was used to calculate interaction energies quantum mechanically, at the same level of theory as before. In addition to 3HP, the fluoride nucleophile was optimized alone in the fixed pocket, as were the combined reactants. All charge density analysis was done as before (with the 225 carbonyl-containing compounds). Only a covalent interaction energy was determined, as there was not enough room in the pocket for a van der Waals complex to form between the fluoride ion and 3HP. The same ANN model that was optimized for the 225 carbonyl-containing compounds was used to predict the covalent interaction energy, using both the BCP and LCP data, as well as the combined data set. The results of our quantum chemical calculations and the ANN model predictions are summarized in Table 4. Δ E interaction = E pocket+reactants(CC+F) – E pocket+CC – E pocket+F + E pocket (Eqn. 2) Table 4. Comparison of covalent Δ E interaction in the binding pocket of the FSA enzyme (kcal/mol) DFT ANN model ANN model ANN model B3LYP/6-31+G* BCP data LCP data Combined data -22.7 -11.9 -19.7 -14.9 As in the case of using the ANN model to predict interaction energies of the 225 carbonyl-containing compounds, training data based on the topological properties of the Laplacian of the charge density led to more accurate predictions of chemical reactivity. We note that when LCP data is used, the 3.0 kcal/mol absolute error in the prediction of the covalent interaction energy of this large and complex enzyme model is about the same as the Mean Absolute Error in covalent interaction energies for smaller and simpler carbonyl compounds, even though no additional training was done. A very recent study combining charge density analysis with machine learning sought to investigate the inhibition mechanism of cruzain, a cysteine protease that is key to the life cycle of the parasite responsible for Chagas disease [31]. Their charge density-based descriptors were limited to BCP training data. Our research indicates that such studies could be greatly improved by including Laplacian-based descriptors among the training data. CONCLUSIONS AND OUTLOOK The rapid advancement of machine learning technology is matched by the growing range of applications to which it is being applied. But like any other advanced technology, it is not a one-size-fits-all tool. We already know that the type and quality of training data is as important as the method by which training occurs. The research presented here reinforces this. Additionally, the converse of this principle is that machine learning can also be used as a tool to empirically explore the information content of different properties of matter that are accessible yet not yet fully understood. When it comes to machine learning applications in materials science and engineering, nothing that is measurable is more fundamental than the charge density. If we completely understood the forces shaping the charge distribution in matter, functional development in Density Functional Theory would be a straightforward matter. It is not. The research presented here has offered insight into the types of applications that different properties of the charge density are better suited as training data. Simultaneous with advancements in machine learning, developments in both the theory and practice of Quantum Crystallography [32] make measurement of the charge density and its derivative properties ever more accessible and accurate. Advancement of nanotechnology has ultimately been dependent on engineering molecular structure and reactivity [33], which chemists achieve via substitutions of one type or another - at the atomic or functional group level. Again, this is not straightforward, because there is incomplete understanding of how atomic substitutions determine changes in molecular properties. And that is the long-term goal of this research; to advance a picotechnology by combining the data-mining and property-prediction power of machine learning with the ever-increasing understanding of the charge distribution - with subatomic resolution . Declarations Author Contribution PJM wrote the main manuscript text and KKD performed the calculations and prepared figures 1-5. All authors reviewed the manuscript. Acknowledgement Financial support is acknowledged from the Office of Science, U.S. Department of Energy (DE-SC0005094). Data Availability Data is provided within the manuscript or supplementary information files. References Bader RFW, MacDougall, PJ, Lau, CDH (1984) J Amer Chem Soc 106:1594–1605. Bader, RFW (1990) Atoms in Molecules: A Quantum Theory. Clarendon Press, Oxford. MacDougall PJ, Henze, CE. (2007) In: Matta CF, Boyd, RJ (eds) The Quantum Theory of Atoms in Molecules: From Solid State to DNA and Drug Design. Wiley-VCH, Weinheim. Coppens, P, Koritsanszky, T (2001) Chem Rev 101:1583–1627. King RD, Marchand-Geneste, N, Alsberg BK (2001) Electronic Transactions on Artificial Intelligence 5B:127–142. Popelier PLA, O’Brien SE (2001) J Chem Inf Comput Sci 41:764–775. Popelier PLA, Smith, PJ (2006), Eur J Med Chem 41: 862–873. Bürgi HB, Dunitz, JD, Lehn JM, Wipff G (1974) Tetrahedron 30:1563–1572. Dapprich S, Komaromi I, Byun, KS, Morokuma K, Frisch MJ (1999) J Mol Struct (THEOCHEM) 461:1–21. Koritsanszky TS, Volkov A, Chodkiewicz M (2012) Struct Bond 147:1–26. Spartan’10 program; Wavefunction Inc.: Irvine, CA. Frisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Montgomery JA Jr, Vreven T, Kudin KN, Burant JC, Millam JM, Iyengar SS, Tomasi J, Barone V, Mennucci B, Cossi M, Scalmani G, Rega N, Petersson GA, Nakatsuji H, Hada M, Ehara M, Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao O, Nakai H, Klene M, Li X, Knox JE, Hratchian HP, Cross JB, Bakken V, Adamo C, Jaramillo J, Gomperts R, Stratmann RE, Yazyev O, Austin AJ Cammi R, Pomelli C, Ochterski JW, Ayala PY, Morokuma K, Voth GA, Salvador P, Dannenberg JJ, Zakrzewski VG, Dapprich S, Daniels AD Strain MC, Farkas O, Malick DK, Rabuck AD, Raghavachari K, Foresman JB, Ortiz JV, Cui Q, Baboul AG, Clifford S, Cioslowski J, Stefanov BB, Liu G, Liashenko A, Piskorz P, Komaromi I, Martin RL, Fox DJ, Keith TA, Al-Laham MA, Peng CY, Nanayakkara A, Challacombe M.; Gill PMW, Johnson B, Chen W, Wong MW, Gonzalez C, Pople JA (2009) Gaussian 09, revision A; Gaussian Inc.: Wallingford, CT. Becke AD (1993) J Chem Phys 98:5648–5652. Lee C, Yang W, Parr RG (1988) Phys Rev B 37:785–589. Hariharan PC, Pople JA (1974) Mol Phys 27:209–214. Zhao Y, Schultz NE, Truhlar DG (2006) J Chem Theory Comput 2:364–382. Hehre WJ, Random L, Schleyer PvR, Pople JA (1986) Ab Initio Molecular Orbital Theory. Wiley, New York. Keith TA (2012) AIMAll, Version 12.05.09, Gristmill Software, Overland Park, KS. Volkov A, Koritsanszky TS, Chodkiewicz M, King HF (2009) J Comput Chem 30:1379–1391. SDBSWeb: http://sdbs.db.aist.go.jp (National Institute of Advanced Industrial Science and Technology, Feb 02, 2016. Kononenko I, Kukar M (2007) Machine learning and data mining: Introduction to principles and algorithms , Horwood publishing. Handley CM, Popelier PLA (2009) J Chem Theory Comput 5: 1474–1489. Rumelhart DE, Hinton GE, Williams RJ (1986) Nature 323:533–536. Widrow B, Lehr MA (1990) Proc. IEEE 78:1415–1442. Terfloth L, Gasteiger J (2001) Drug Discovery Today 6:102–108. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA Data Mining Software: An Update , SIGKDD Explorations 11:1. Kruse H, Goerigk L, Grimme S (2012) J Org Chem 77:10824–10834. Bader RFW, MacDougall PJ (1985) J Amer Chem Soc 107:6788–6795. Varadwaj PR, Varadwaj A, Marques HM, MacDougall PJ (2019) Phys Chem Chem Phys 21:19969–19986. Koritsanszky TS, Volkov A, Chodkiewicz M (2010) Structure and Bonding 147:1–25. Thorell S, Schurmann M, Sprenger GA, Schneider G (2002) J Mol Biol 319:161–171. Protein Data Bank entry code 1L6W. Luchi AM, Villafañe RN, Gómez-Chávez JL, Bogado ML, Angelina EL, Peruchena NM (2019) ACS Omega 4:19582–19594. Massa L, Matta CF (2017) J Comput Chem 39:1021–1028. Cademartiri L, Ozin GA (2009) Concepts of Nanochemistry, Wiley VCH, Germany. Additional Declarations No competing interests reported. Supplementary Files SupplementaryMaterials.doc SupplementaryMaterialsBCPdata.xls SupplementaryMaterialsLCPdata.xls Cite Share Download PDF Status: Published Journal Publication published 26 Sep, 2024 Read the published version in Theoretical Chemistry Accounts → Version 1 posted Editorial decision: Revision requested 31 Jul, 2024 Reviews received at journal 31 Jul, 2024 Reviewers agreed at journal 31 Jul, 2024 Reviews received at journal 23 Jul, 2024 Reviewers agreed at journal 15 Jul, 2024 Reviewers agreed at journal 11 Jul, 2024 Reviewers agreed at journal 09 Jul, 2024 Reviewers invited by journal 09 Jul, 2024 Editor assigned by journal 09 Jul, 2024 Submission checks completed at journal 08 Jul, 2024 First submitted to journal 01 Jul, 2024 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-4669576","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":334258385,"identity":"7526ec1d-4c55-42ad-bc08-d758e87e66cc","order_by":0,"name":"Preston J. MacDougall","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA20lEQVRIiWNgGAWjYLACHhDBztz4AMo3IFILM2OzAUMCiVraJIjSYi52+OGHNxWH7fqbGduqbv6oS2xgb94mgU+L5ew0Y8k5Zw4nzzjM2HY7J+FwYgPPsTK8WgxuJ5gx87YdTmaAaDmQ2CCRY0ZAS/o3sBZ5oJbinASgw+TfENKSA7bFzgCohTkngRloCw9+LZazc4qBfklPMDzM2Cydk3bYuI0nrdgCnxZz6fSNwBCztpc73nzwc45NnWw/++GNN/A6DEonNsBE2PApR9ZiT0jhKBgFo2AUjGAAANKTSdvlXo8VAAAAAElFTkSuQmCC","orcid":"","institution":"Middle Tennessee State University","correspondingAuthor":true,"prefix":"","firstName":"Preston","middleName":"J.","lastName":"MacDougall","suffix":""},{"id":334258386,"identity":"edbe8152-cb34-4536-a5c7-246ff2682f37","order_by":1,"name":"Kiran K. Donthula","email":"","orcid":"","institution":"Middle Tennessee State University","correspondingAuthor":false,"prefix":"","firstName":"Kiran","middleName":"K.","lastName":"Donthula","suffix":""}],"badges":[],"createdAt":"2024-07-01 17:10:35","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-4669576/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-4669576/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s00214-024-03142-9","type":"published","date":"2024-09-26T15:58:05+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":61587127,"identity":"c6bbdd13-0242-47a9-85b6-c30422957c62","added_by":"auto","created_at":"2024-08-01 14:33:16","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":47908,"visible":true,"origin":"","legend":"\u003cp\u003eMolecular graph of 2-methylbutanal. The green dots are (3,-1) BCPs. Only the topological properties of the three BCPs for bonds to the carbonyl carbon are used as training data here.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/38900c4b8812d81147920330.png"},{"id":61587134,"identity":"0a30849d-8add-41ed-a90b-3d2729bf53ee","added_by":"auto","created_at":"2024-08-01 14:33:17","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":233026,"visible":true,"origin":"","legend":"\u003cp\u003eThe zero-envelope of the Laplacian distribution for 2-methylbutanal. The value of ∇\u003csup\u003e2\u003c/sup\u003er = 0.0 au at every point of the surface shown. While there is only \u003cem\u003eone \u003c/em\u003etopological type of BCP, \u003cem\u003etwo\u003c/em\u003e of the four topological types of LCPs were used as training data. The blue dots are (3,+3) LCPs, and correspond to bonded or nonbonded charge concentrations in the VSCCs of the respective atoms. The red dots are LCPs that are saddle points. Only the topological properties of the three bonded charge concentrations in the VSCC of the carbonyl carbon, as well as those of the two (3,-1) LCPs in the VSCC of the carbonyl carbon that correspond to the π-holes, are used as training data here.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/bd4805180bc431ef2441d8a9.png"},{"id":61587131,"identity":"51b62e87-79ec-446a-86e7-7e6e10604827","added_by":"auto","created_at":"2024-08-01 14:33:16","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":66656,"visible":true,"origin":"","legend":"\u003cp\u003eStarting geometry and interaction energies between a nucleophilic fluoride ion and the carbonyl group of acetone, with carbon-fluorine distance fixed at values between 1 and 3.5 Angstroms, and all other parameters optimized.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/00514a5e77515048ef52dd29.png"},{"id":61587128,"identity":"0a068663-7879-4bfc-ad88-dcda578eb603","added_by":"auto","created_at":"2024-08-01 14:33:16","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":121449,"visible":true,"origin":"","legend":"\u003cp\u003e(Top) The cluster used to model the FSA active site, with glycol substrate removed from the active site pocket. Residues 1, 2, 3, 4, and 5 are Ala165, Tyr131*, Lys85, Asn28, and Asp6, respectively.\u0026nbsp; Broken peptide bonds are capped with hydrogens, and non-hydrogen nuclear coordinates are taken from the protein crystal structure [30]. (Bottom) The ∇\u003csup\u003e2\u003c/sup\u003eρ = 0.0 isosurface for the Laplacian of the charge density computed for the quantum chemical model of the TSA active site cluster above, with 3-hydroxypropanal added, and optimized, in place of glycerol. *In this FSA active site model, the ring of the tyrosine residue was mistakenly saturated instead of being aromatic.\u0026nbsp; As this was done consistently in all calculations and ANN training, but not noticed until a referee comment, the effect on this proof-of-principle stress-test is immaterial.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/208fbd89e939410d35bd75f6.png"},{"id":61587535,"identity":"f01c84a6-7ccb-4ab6-911a-592893577a15","added_by":"auto","created_at":"2024-08-01 14:41:16","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":161392,"visible":true,"origin":"","legend":"\u003cp\u003e(Top) A contour diagram of the Laplacian of the charge density for the π-plane of the carbonyl group of 3-hydroxypropanal (3HP) within the binding pocket of FSA enzyme. \u0026nbsp;(Bottom) A contour diagram of the Laplacian of the charge density AFTER a fluoride ion has formed a polar covalent bond with the carbonyl carbon of 3HP within the binding pocket of the FSA enzyme. In both plots, the dashed (solid) contour lines denote regions of charge concentration (depletion). The more-or-less linear lines are atomic interaction lines; with solid paths indicating covalent bonds, and dashed paths indicating van der Waals interactions.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/d876c162e1b3093aca9d077e.png"},{"id":65627376,"identity":"ffe5fe04-1070-421c-ba0a-18b49b709097","added_by":"auto","created_at":"2024-09-30 16:15:34","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1116314,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/23678f2d-818a-4512-9c97-1d41f970faf8.pdf"},{"id":61587133,"identity":"726edeb6-c462-4a21-a0c6-d9714c24026f","added_by":"auto","created_at":"2024-08-01 14:33:16","extension":"doc","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":1807360,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterials.doc","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/e12a3caf1cde2f3be5602f3e.doc"},{"id":61587536,"identity":"ebe96229-6477-40d2-af80-7d5dbc2556bb","added_by":"auto","created_at":"2024-08-01 14:41:16","extension":"xls","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":114688,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterialsBCPdata.xls","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/55f8e86f5d618a9e41138586.xls"},{"id":61587537,"identity":"cd3b35c5-c078-46c6-88ad-af4c2e25e363","added_by":"auto","created_at":"2024-08-01 14:41:16","extension":"xls","order_by":3,"title":"","display":"","copyAsset":false,"role":"supplement","size":166912,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterialsLCPdata.xls","url":"https://assets-eu.researchsquare.com/files/rs-4669576/v1/79be3d9872a0a3a4e74d4df0.xls"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eUsing Quantum Atomics and Machine Learning to Advance Picotechnology\u003c/p\u003e","fulltext":[{"header":"WHAT IS PICOTECHNOLOGY?","content":"\u003cp\u003eNanotechnology is the scientific investigation of matter at the nanometer scale\u0026nbsp;(1 nm = 10\u003csup\u003e-9\u003c/sup\u003e m) \u0026nbsp;\u0026ndash; roughly from one to hundreds of nanometers \u0026ndash; and engineering systems that take advantage of the properties of matter at that scale. \u0026nbsp;Drug-like organic molecules have dimensions measuring up to just a few nanometers. \u0026nbsp;So, dating back to the mid-1800s, organic chemistry is the oldest form of nanotechnology. \u0026nbsp;\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eSimilarly, but a century or so younger, femtotechnology corresponds to the scientific investigation of matter at the femtometer scale (1 fm = 10\u003csup\u003e-15\u003c/sup\u003e m) and engineering systems that take advantage of the properties of matter at that scale. \u0026nbsp;It is better known as nuclear engineering.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eAny future picotechnology should therefore be defined as the scientific investigation of matter at the picometer scale (1 pm = 10\u003csup\u003e-12\u003c/sup\u003e m) \u0026ndash; from one to hundreds of picometers - and engineering systems that take advantage of the properties of matter at that scale. \u0026nbsp;In other words, \u003cem\u003eatomic engineering\u003c/em\u003e!\u003c/p\u003e\n\u003cp\u003eSince most atoms have diameters over 100 pm, picotechnology is necessarily concerned with subatomic features (outside the nucleus). \u0026nbsp; Chemists routinely refer to such features conceptually, such as when using arrow-pushing diagrams to show a nonbonding electron pair on one atom filling an unoccupied atomic orbital on another atom, forming a new covalent bond. \u0026nbsp;Technologies, however, are not purely conceptual. \u0026nbsp;They also have physical manifestations. \u0026nbsp;Physical, reproducible, and chemically significant subatomic features have been identified and in the topology of the Laplacian of the charge density,\u0026nbsp;\u0026Ntilde;2r, for atoms from all across the periodic table, and in many forms of matter, by both computational and experimental researchers alike [1-4].\u003c/p\u003e\n\u003cp\u003ePioneering chemical informatics work by Alsberg \u003cem\u003eet al.\u003c/em\u003e noted that \u0026ldquo;The (QTAIM) approach is well suited for machine inference as it is topologically based and close to the na\u0026iuml;ve atom-bond current ILP (inductive logic programming) representation.\u0026rdquo; [5]. \u0026nbsp; Concurrently, Popelier \u003cem\u003eet al.\u003c/em\u003e began developing Quantum Topological Molecular Similarity (QTMS) with charge density topological data (at bond critical points) to predict a variety of empirical chemical parameters, such as Hammett constants\u003cem\u003e\u0026nbsp;\u003c/em\u003efor benzoic acids derivatives [6]. These and similar studies motivated us to consider whether other charge density descriptors could prove useful in training neural networks to predict more localized and directional chemical properties, such as mapping van der Waals interaction energies, which are key to drug design and modelling biomolecular interactions.\u003c/p\u003e\n\u003cp\u003eFor our neural network training data, in addition to the bond critical point data that others have used, we also include topological features in\u0026nbsp;\u0026Ntilde;2r, specifically within the valence-shell charge concentration (VSCC) of carbon atoms in carbonyl groups (Fig. 2). Such subatomic topological features have previously been shown to predict B\u0026uuml;rgi-Dunitz angles of nucleophilic attack [7], and even changes to that angle upon protonation of the carbonyl oxygen [1]. \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eHowever, like all other wavefunction-based quantum chemical modelling of chemical reactivity, the topological analysis of localized sites of attack first required time-consuming Coulomb and exchange-correlation integrals spanning the entire molecule. Even using DFT, which omits the latter and most time-consuming integrals, if the carbonyl group of interest is part of a \u003cem\u003every large\u003c/em\u003e protein complex, then most computation time is wasted on computing the charge density around atoms unrelated to the reactive site. \u0026nbsp;QM/MM embedding methods such as ONIOM [8] can further economize by focusing high-level computing effort on the reactive site, but that is not the objective of the current work.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u0026nbsp;\u003cem\u003eOur goal is to investigate the feasibility of training neural networks to predict carbonyl reactivity using descriptors based on topological properties of subatomic features within the VSCC of carbonyl carbon atoms (measured in picometers) \u0026ndash; regardless of how the charge density is computed.\u003c/em\u003e\u0026nbsp; It could, in fact, be measured with high-resolution, single-crystal X-ray diffraction [4].\u003c/p\u003e\n\u003cp\u003eUltimately, because the charge densities of functional groups can be very transferable between molecules [2] (even when they differ greatly on the periphery), we expect that in the future picotechnology that we envision, the necessary descriptor subatomic topological features will be very rapidly and accurately constructed from fragment databases [9]. \u0026nbsp;Thus, when picotechnology is mature, \u003cem\u003eno quantum chemical calculations, or high-resolution X-ray diffraction experiments will be required to predict chemical reactivity at the reactive site of a molecule.\u0026nbsp;\u003c/em\u003e\u0026nbsp;\u003c/p\u003e"},{"header":"COMPUTATIONAL METHODS ","content":"\u003ch2\u003e2.1Quantum Chemistry\u003c/h2\u003e\n\u003cp\u003eThe initial Cartesian coordinates of 225 carbonyl compounds, including aldehydes, ketones and amides were obtained using molecular modelling software Spartan’10 [10]. The molecules used in this study vary in size ranging from 7 to 31 atoms. A complete list of compounds included in the training set can be found in Supplementary Materials (Table 1). Full geometry optimization (of default Spartan-generated conformers) and generation of the wavefunction files were carried out using Gaussian09 [11]. All the molecules were optimized at B3LYP/6-31+G* and wavefunctions were obtained at M05-2X/6-311++G** level of theory [12-16]. The wavefunctions obtained from these calculations were used to calculate the topological properties of the molecules such as electron density, Laplacian of electron density, Hessian eigenvalues (of the corresponding Hessian matrix, r or Ñ2r), bond critical points, and critical points in Ñ2r. \u0026nbsp;The topological properties of electron density at bond critical points were determined using AIMAll [17].Whereas the topological properties at critical points of the Laplacian of the charge density were determined using Denprop [18]. The topological data used for descriptors in neural network training were extracted using a Python script to create the data sets. \u0026nbsp;The types of critical points that are located and used as training data in this research are illustrated in Figures 1 and 2.\u003c/p\u003e\n\u003ch2\u003e2.2 Data Sets and Quantum Atomics\u003c/h2\u003e\n\u003cp\u003e\u003cem\u003eQuantum Atomics\u003c/em\u003e refers to topological and/or integrated atomic data obtained via, or within the context of the Quantum Theory of Atoms in Molecules (QTAIM). \u0026nbsp;Numerous computational and experimental groups worldwide have been collecting such data for decades, without referring to it as such. \u0026nbsp;It is perhaps a useful term for distinguishing such data from other computational chemistry and crystallographic parameters that are routinely used, but lack the rigorous foundation of QTAIM. \u0026nbsp;A second over-arching goal of this research is to explore the information content “quality” of various types of Quantum Atomic data for specific machine learning goals. \u0026nbsp;For instance, if the goal is to predict chemical reactivity, should training data include topological properties of the charge density, or the Laplacian of the charge density? \u0026nbsp;Although we have not explored such questions here, future research could address whether more efficient and accurate machine learning can be achieved with specific \u003cem\u003eintegrated\u003c/em\u003e atomic properties that are defined uniquely within QTAIM, such as atomic energy, magnetic susceptibility, electric polarizability, multipole moments, etc…\u003c/p\u003e\n\u003cp\u003eIn this research, a total of nine separate data sets were used: three types topological data (BCP, LCP, combined) trained with three types of physicochemical data (C-13 NMR, C=O frequencies, nucleophile interaction energies) with each data set containing critical point descriptors for all 225 molecules were used to train artificial neural networks (ANNs) to predict C-13 chemical shifts, C=O stretching frequencies, and interaction energies with a model nucleophile (the fluoride ion). Among the nine data sets were three of each of the following: bond critical point (BCPs), Laplacian critical point (LCPs), and combined data sets (with both BCP and LCP data). \u0026nbsp;Each of the previously mentioned data sets contains the following: a class label from experimental C-13 shifts, C=O stretching frequencies, or theoretical interaction energy values. The experimental C-13 chemical shift in ppm and C=O stretching frequency in cm\u003csup\u003e-1\u0026nbsp;\u003c/sup\u003evalues were collected from spectral database of organic compounds library (SDBS) [19]. \u0026nbsp;All topological data in the training sets can be found in Supplementary Materials (Table 2). \u0026nbsp; An example of the LCP portion of such a data set for 1-butanal is shown in Table 1.\u003c/p\u003e\n\u003cp\u003eTable 1. Sample Laplacian critical point (LCP) input data for 2-chloro-4-fluorobenzaldehyde (all units au)\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"565\"\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd width=\"16.07773851590106%\" valign=\"top\"\u003e\n \u003cp\u003e\u0026nbsp; \u0026nbsp; Distance \u0026nbsp; \u0026nbsp; from C nucleus\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"top\"\u003e\n \u003cp\u003eλ\u003csub\u003e1\u003c/sub\u003e\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"top\"\u003e\n \u003cp\u003eλ\u003csub\u003e2\u003c/sub\u003e\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"top\"\u003e\n \u003cp\u003eλ\u003csub\u003e3\u003c/sub\u003e\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"top\"\u003e\n \u003cp\u003eρ\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"top\"\u003e\n \u003cp\u003e∇\u003csup\u003e2\u003c/sup\u003eρ\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"16.07773851590106%\" valign=\"bottom\"\u003e\n \u003cp\u003e1.0282\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e-1.140\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e-0.614\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e10.800\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.141\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.0294\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"16.07773851590106%\" valign=\"bottom\"\u003e\n \u003cp\u003e1.0282\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e-1.140\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e-0.614\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e10.800\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.141\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.0294\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"16.07773851590106%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.9701\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e4.300\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e5.060\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e22.000\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.292\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e-1.0100\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"16.07773851590106%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.9790\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e6.480\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e6.900\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e18.800\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.302\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e-1.2300\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"16.07773851590106%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.9811\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e5.230\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e8.960\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e18.000\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e0.431\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"16.784452296819786%\" valign=\"bottom\"\u003e\n \u003cp\u003e-1.0700\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\n\u003cp\u003eThe interaction energy (Δ\u003cem\u003eE\u003c/em\u003e\u003csub\u003einteraction\u003c/sub\u003e) between carbonyl compounds and a fluoride ion (F\u003csup\u003e-1\u003c/sup\u003e) in the nucleophilic addition reaction was calculated at B3LYP/6-31+G* level with the following approach, \u0026nbsp;\u003c/p\u003e\n\u003cp\u003eΔ\u003cem\u003eE\u003c/em\u003e\u003csub\u003einteraction\u0026nbsp;\u003c/sub\u003e= \u003cem\u003eE\u003c/em\u003e\u003csub\u003eCC+F\u0026nbsp;\u003c/sub\u003e- \u003cem\u003eE\u003c/em\u003e\u003csub\u003eCC\u0026nbsp;\u003c/sub\u003e- \u003cem\u003eE\u003c/em\u003e\u003csub\u003eF\u003c/sub\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; (Eqn. 1) \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;\u0026nbsp;\u003c/p\u003e\n\u003cp\u003ewhere \u003cem\u003eE\u003c/em\u003e\u003csub\u003e\u0026nbsp;CC+F\u003c/sub\u003e\u003cem\u003e, E\u003csub\u003eCC\u0026nbsp;\u003c/sub\u003e\u003c/em\u003eand \u003cem\u003eE\u003c/em\u003e\u003csub\u003eF\u003c/sub\u003e\u0026nbsp; are total energies of the carbanion complex, carbonyl compound, and the fluoride ion nucleophile, respectively. \u0026nbsp;Basis set superposition errors (BSSE) were not calculated,as they would have considerably increased computational time, while yielding no significant value. \u0026nbsp;The machine learning here is based on properties of the electron density, which are completely unaffected by BSSE.\u003c/p\u003e\n\u003ch2\u003e2.3 Artificial Neural Network Model\u003c/h2\u003e\n\u003cp\u003eOver the past few decade artificial neural networks (ANNs) have had huge success in machine learning and data mining applications [20]. \u0026nbsp;Recently, Handley and Popelier have pioneered use of QTAIM data (atomic multipole moments) in conjunction with machine learning to model the fluctuating polarizability of water molecules in molecular dynamics simulations [21]. \u0026nbsp;ANNs are powerful tools in advanced computing that analyse information quantitatively by learning from training data. \u0026nbsp;The important properties of ANNs are the learning ability of a network from its environment and improving performance by learning. A learning algorithm is a procedure in which the learning rules are used to adjust the weights. The ANNs consists of input layer, one or more hidden layers and output layer. The input signal propagates layer by layer in the forward direction and these networks are commonly called as multilayer perceptrons (MLP) [22,23]. \u0026nbsp;MLP with back-propagation learning method (BP) is one of the successfully used methods in chemistry and drug design because of its well-defined and explicit set of equations for weight corrections [24]. \u0026nbsp;This is a supervised learning algorithm in which the network is trained with a training data with expected outputs are provided to get the algorithm trained. The learning consists of a forward pass and backward pass. In the forward pass, the input vector is applied to the input layer and these input values are modified by a fixed weight and its effect passes through the network layer by layer. Finally, output produced by the network compared with desired output to calculate the error signal. This error signal is then back propagated in the backward pass to adjust the weights in such a way that the actual output value move closer to the desired output value according to error correction rule [22].\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eIn this study, the machine learning package WEKA 3.6.13 was used for the ANN model development [25]. In our model, we used 1 hidden layer with 9 neurons configuration when it was trained with bond critical point data and 1 hidden layer with 15 neurons configuration when the model was trained with Laplacian critical point data. The predictive ability of the network is determined by validation techniques. We have used the leave-one-out cross-validation technique, in which the whole data set was divided into 225 pieces, 224 pieces used for training and one piece for testing. The mean absolute percent errors were calculated by comparing predicted values with actual values.\u0026nbsp;\u003c/p\u003e"},{"header":"Proof-of-concept Spectroscopic Predictions","content":"\u003cp\u003eAs a proof-of-concept exercise, we first attempted to predict routinely measured infrared stretching frequencies of the carbonyl bonds and the \u003csup\u003e13\u003c/sup\u003eC chemical shift of the carbonyl carbons. \u0026nbsp;Such properties can be accurately approximated by semi-empirical wavefunction-based methods, but their prediction with spectroscopic precision using true \u003cem\u003eab initio\u003c/em\u003e methods is still quite challenging. \u0026nbsp; Development of alternative and highly efficient methodologies for predicting even these properties is still of interest.\u003c/p\u003e\u003cp\u003eTable 2 presents a summary of results using optimized machine learning parameters within WEKA. \u0026nbsp; Default parameters yield results of significantly lower quality. \u0026nbsp;While nothing close to spectroscopic accuracy, the results are promising considering that data for \u003cem\u003eonly\u003c/em\u003e 3 BCPs and 5 LCPs were used for each compound. \u0026nbsp;Predictions for individual carbonyl types, using default and optimized parameters (learning rate, number of epochs and hidden neurons), as well as the optimized ANN parameters themselves, can be found in Supplementary Materials (Tables 3, 4, and 5, respectively). \u0026nbsp;For instance, the Mean Absolute Percentage Error (MAPE) in predicting C=O stretching frequency for the 100+ \u003cem\u003eketones only\u003c/em\u003e, was 0.41% (better than results for all carbonyl types, Table 2).\u003c/p\u003e\u003cp\u003eIt is interesting to note that, as expected, the combined (BCP and LCP) data set yields better predictions, on average. \u0026nbsp;And while there is insignificant difference between the BCP and LCP data sets in predicting C=O stretching frequencies, the topological properties of the Laplacian of the charge density appears to be “better” training data for predicting chemical shifts. \u0026nbsp;This, too, may not be unexpected since the NMR chemical shift of a nucleus depends on shielding over the entire volume of the carbon atom, not just in the s-plane, where all the BCPs are located.\u003c/p\u003e\u003cp\u003eTable 2. MAPE of predicted values of C13 chemical shifts and C=O stretching frequencies for 225 carbonyl compounds\u003c/p\u003e\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd width=\"31.191222570532915%\" valign=\"top\"\u003e\n \u003cp\u003eData set\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.42319749216301%\" valign=\"top\"\u003e\n \u003cp\u003eC13 chemical shift\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.38557993730407%\" valign=\"top\"\u003e\n \u003cp\u003eC=O stretching frequency\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.191222570532915%\" valign=\"top\"\u003e\n \u003cp\u003eLaplacian critical point data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.42319749216301%\" valign=\"top\"\u003e\n \u003cp\u003e1.335\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.38557993730407%\" valign=\"top\"\u003e\n \u003cp\u003e0.659\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.191222570532915%\" valign=\"top\"\u003e\n \u003cp\u003eBond critical point data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.42319749216301%\" valign=\"top\"\u003e\n \u003cp\u003e1.596\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.38557993730407%\" valign=\"top\"\u003e\n \u003cp\u003e0.652\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.191222570532915%\" valign=\"top\"\u003e\n \u003cp\u003eCombined data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.42319749216301%\" valign=\"top\"\u003e\n \u003cp\u003e1.308\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.38557993730407%\" valign=\"top\"\u003e\n \u003cp\u003e0.641\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e"},{"header":"Nucleophilic Interaction Energy Predictions","content":"\u003cp\u003eHere, we report the prediction of both covalent and van der Waals interaction energies using ANNs trained on r and/or Ñ2r critical point descriptors. A nucleophilic addition reaction between a fluoride ion and a carbonyl group was taken as a model chemical reaction for our investigation of machine learning of chemical reactivity. \u0026nbsp;As shown in Figure 3, preliminary investigations of reaction energy profiles, at multiple levels of theory, between a fluoride ion and acetone revealed dual minima; one at a short C-F distance (“covalent” bond formation), and one at a long C-F distance (van der Waals interaction). \u0026nbsp;The interaction energies (Eqn. 1) were calculated for both strong (covalent bond formation) and weak (van der Waals) interactions for our set of 225 carbonyl-containing molecules. \u0026nbsp; Our ANN was then trained on this data, using BCP, LCP, and combined data sets, as before. \u0026nbsp; A summary of Mean Absolute Error (MAE) results for leave-one-out cross-validation machine learning predictions of interaction energies are shown in Table 3.\u003c/p\u003e\u003cp\u003eTable 3. \u0026nbsp;MAE of predicted interaction energies with optimum parameters for training the ANN model\u003c/p\u003e\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd width=\"31.26022913256956%\" valign=\"top\"\u003e\n \u003cp\u003eData set\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.3518821603928%\" valign=\"top\"\u003e\n \u003cp\u003ecovalent interaction energy (kcal/mol)\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.387888707037646%\" valign=\"top\"\u003e\n \u003cp\u003evan der Waals interaction energy\u0026nbsp;(kcal/mol)\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.26022913256956%\" valign=\"top\"\u003e\n \u003cp\u003eLaplacian critical point data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.3518821603928%\" valign=\"top\"\u003e\n \u003cp\u003e2.63\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.387888707037646%\" valign=\"top\"\u003e\n \u003cp\u003e5.06\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.26022913256956%\" valign=\"top\"\u003e\n \u003cp\u003eBond critical point data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.3518821603928%\" valign=\"top\"\u003e\n \u003cp\u003e3.39\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.387888707037646%\" valign=\"top\"\u003e\n \u003cp\u003e4.80\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.26022913256956%\" valign=\"top\"\u003e\n \u003cp\u003eCombined data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"35.3518821603928%\" valign=\"top\"\u003e\n \u003cp\u003e2.56\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"33.387888707037646%\" valign=\"top\"\u003e\n \u003cp\u003e4.78\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003cp\u003eThe ANN model yields much better predictions for covalent, as opposed to van der Waals interaction energies. \u0026nbsp; This is an unsurprising result, since machine learning is highly dependent on the quality of the training data, and the B3LYP functional is well-known to poorly reproduce the dispersion energy, which is critical to van der Waals interactions [26]. \u0026nbsp;Since DFT has become a standard workhorse in molecular modelling, future research in machine learning based on charge density descriptors that have been derived from more generally reliable functionals is warranted. \u0026nbsp;With regard to the covalent interaction energy predictions of the optimized ANN model, MAEs are above the desired chemical accuracy that can be achieved easily with inexpensive empirical methods, but they support the idea that there is very high-quality information content in just a small number of charge density descriptors. \u0026nbsp;We note in particular that LCP data is significantly better at predicting covalent interaction energies than the commonly used BCP data. \u0026nbsp;This should not be surprising, because although the topological properties of the charge density led to the development of QTAIM in the first place, it is the topological properties of the Laplacian of the charge density that have been more useful in subsequent [27], as well as ongoing, studies of chemical reactivity based on the charge density [28]. \u0026nbsp;This is perhaps the most salient observation from the current investigation: \u003cstrong\u003e\u003cem\u003eNot only is the topology of the Laplacian of the charge density more complex than that of the electron density (many more critical points), but the information content at these critical points is higher when the objective is to predict chemical reactivity.\u0026nbsp;\u003c/em\u003e\u003c/strong\u003ePredictions for individual carbonyl types, using default and optimized parameters (learning rate, number of epochs and hidden neurons), as well as the optimized ANN parameters themselves, can be found in Supplementary Materials (Tables 6, 7, and 8, respectively).\u003c/p\u003e"},{"header":"stress-test: predicting interaction energies of substrates in an enzyme’s active site","content":"\u003cp\u003eThe accuracy of our ANN model predictions of the spectroscopic properties and chemical reactivity of the 225 carbonyl-containing compounds studied in this research is not the competitive with standard quantum chemical or even semi-empirical methods. \u0026nbsp;And our ANN model is based on training data that is derived \u003cem\u003efrom\u003c/em\u003e the aforementioned quantum chemical methods. \u0026nbsp; However, if developed further, the type of ANN model that we have explored \u003cem\u003ewill\u003c/em\u003e become competitive when the chemical systems become too big to do rapid conventional computational chemistry \u003cem\u003eof any kind.\u003c/em\u003e\u0026nbsp; Our ANN model depends \u003cem\u003eonly\u0026nbsp;\u003c/em\u003eon topological properties of the charge density, which is measurable, and only at the site of interaction. \u0026nbsp;While it is necessary, currently, to have very high-resolution single-crystal X-ray diffraction data [4] to determine the topological properties of the charge density that our ANN model is trained with, progress is being made in Quantum Crystallographic databases [29] that would enable \u003cem\u003every rapid\u003c/em\u003e approximation of the topological properties required to train or apply our ANN model, and for a molecule of \u003cem\u003eany size\u003c/em\u003e, as long as the nuclear coordinates are known, either by low-resolution structure determination methods, or by rudimentary classical molecular mechanics.\u003c/p\u003e\u003cp\u003eWith a leap of faith into this hopeful future, we tested the accuracy of our small molecule-trained ANN model in a stress-test involving nucleophilic addition to a natural carbonyl-containing substrate of the \u003cem\u003eE. coli\u0026nbsp;\u003c/em\u003eenzyme, D-fructose-6-phosphate aldolase (FSA), which catalyzes such a nucleophilic addition, and which has a known molecular structure with a bound substrate (glycerol) [30]. \u0026nbsp;We replaced glycerol with 3-hydroxypropanal (3HP), which nucleophilic addition to is catalyzed naturally by FSA, retaining hydrogen bonding contacts between the hydroxyl group and residues Asn28 and Asp6 (Residues 4 and 5, respectively, in Fig. 4). \u0026nbsp;We then optimized the substrate in a fixed binding pocket. \u0026nbsp;The ∇\u003csup\u003e2\u003c/sup\u003eρ = 0.0 isosurface for the Laplacian of the charge density for 3HP in the binding pocket is shown in Figure 4. \u0026nbsp;Many π-holes are evident, such as those in the VSCC of carbons in the aromatic ring of tyrosine (Residue 2 in Fig. 4), which measure about 20 picometers in diameter, or less, if there is a π-hole at all. \u0026nbsp;As expected, the largest π-hole is on the carbonyl carbon of 3HP (near the center of the cluster in Figure 4, bottom), which, after all, is the target of the nucleophilic attack. \u0026nbsp;Its π-hole measures approximately 50 picometers in diameter, and is also shown in a cross-sectional view in Figure 5, both before and after covalent bond formation. \u0026nbsp;These well-defined, observable, and often transferable subatomic features, which are made evident by the Laplacian of the charge density, are clearly related to the chemical properties of the respective atoms. \u0026nbsp;Data associated with these features, such as their sizes, locations, and topological properties, all fall under the label of Quantum Atomics. \u0026nbsp;Their size, in particular, also serves to illustrate and justify at least the prefix of the term picotechnology.\u003c/p\u003e\u003cp\u003eEqn. 2 was used to calculate interaction energies quantum mechanically, at the same level of theory as before. \u0026nbsp;In addition to 3HP, the fluoride nucleophile was optimized alone in the fixed pocket, as were the combined reactants. \u0026nbsp;All charge density analysis was done as before (with the 225 carbonyl-containing compounds). \u0026nbsp;Only a covalent interaction energy was determined, as there was not enough room in the pocket for a van der Waals complex to form between the fluoride ion and 3HP. \u0026nbsp;The same ANN model that was optimized for the 225 carbonyl-containing compounds was used to predict the covalent interaction energy, using both the BCP and LCP data, as well as the combined data set. \u0026nbsp;The results of our quantum chemical calculations and the ANN model predictions are summarized in Table 4. \u0026nbsp;\u0026nbsp;\u003c/p\u003e\u003cp\u003eΔ\u003cem\u003eE\u003c/em\u003e\u003csub\u003einteraction\u0026nbsp;\u003c/sub\u003e= \u003cem\u003eE\u003c/em\u003e\u003csub\u003epocket+reactants(CC+F)\u0026nbsp;\u003c/sub\u003e– \u003cem\u003eE\u003c/em\u003e\u003csub\u003epocket+CC\u0026nbsp;\u003c/sub\u003e– \u003cem\u003eE\u003c/em\u003e\u003csub\u003epocket+F\u003c/sub\u003e + \u003cem\u003eE\u003c/em\u003e\u003csub\u003epocket\u003c/sub\u003e\u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp; \u0026nbsp;(Eqn. 2)\u003c/p\u003e\u003cp\u003eTable 4. \u0026nbsp; Comparison of covalent Δ\u003cem\u003eE\u003c/em\u003e\u003csub\u003einteraction\u003c/sub\u003e in the binding pocket of the FSA enzyme (kcal/mol)\u003c/p\u003e\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\"\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd width=\"31.645569620253166%\" valign=\"top\"\u003e\n \u003cp\u003eDFT\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"24.050632911392405%\" valign=\"top\"\u003e\n \u003cp\u003eANN model\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"21.518987341772153%\" valign=\"top\"\u003e\n \u003cp\u003eANN model\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"22.78481012658228%\" valign=\"top\"\u003e\n \u003cp\u003eANN model\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.645569620253166%\" valign=\"top\"\u003e\n \u003cp\u003eB3LYP/6-31+G*\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"24.050632911392405%\" valign=\"top\"\u003e\n \u003cp\u003eBCP data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"21.518987341772153%\" valign=\"top\"\u003e\n \u003cp\u003eLCP data\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"22.78481012658228%\" valign=\"top\"\u003e\n \u003cp\u003eCombined data\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd width=\"31.645569620253166%\" valign=\"top\"\u003e\n \u003cp\u003e-22.7\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"24.050632911392405%\" valign=\"top\"\u003e\n \u003cp\u003e-11.9\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"21.518987341772153%\" valign=\"top\"\u003e\n \u003cp\u003e-19.7\u003c/p\u003e\n \u003c/td\u003e\u003ctd width=\"22.78481012658228%\" valign=\"top\"\u003e\n \u003cp\u003e-14.9\u003c/p\u003e\n \u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/table\u003e\u003cp\u003eAs in the case of using the ANN model to predict interaction energies of the 225 carbonyl-containing compounds, training data based on the topological properties of the Laplacian of the charge density led to more accurate predictions of chemical reactivity. \u0026nbsp;We note that when LCP data is used, the 3.0 kcal/mol absolute error in the prediction of the covalent interaction energy of this large and complex enzyme model is about the same as the Mean Absolute Error in covalent interaction energies for smaller and simpler carbonyl compounds, \u003cstrong\u003e\u003cem\u003eeven though no additional training was done.\u003c/em\u003e\u003c/strong\u003e\u0026nbsp; \u0026nbsp;A very recent study combining charge density analysis with machine learning sought to investigate the inhibition mechanism of cruzain, a cysteine protease that is key to the life cycle of the parasite responsible for Chagas disease [31]. \u0026nbsp;Their charge density-based descriptors were limited to BCP training data. \u0026nbsp;Our research indicates that such studies could be greatly improved by including Laplacian-based descriptors among the training data.\u003c/p\u003e"},{"header":"CONCLUSIONS AND OUTLOOK","content":"\u003cp\u003eThe rapid advancement of machine learning technology is matched by the growing range of applications to which it is being applied. \u0026nbsp; But like any other advanced technology, it is not a one-size-fits-all tool. \u0026nbsp;We already know that the type and quality of training data is as important as the method by which training occurs. \u0026nbsp;The research presented here reinforces this. \u0026nbsp;Additionally, the converse of this principle is that machine learning can also be used as a tool to empirically explore the information content of different properties of matter that are accessible yet not yet fully understood. \u0026nbsp; When it comes to machine learning applications in materials science and engineering, nothing that is measurable is more fundamental than the charge density. \u0026nbsp;If we completely understood the forces shaping the charge distribution in matter, functional development in Density Functional Theory would be a straightforward matter. \u0026nbsp;It is not. The research presented here has offered insight into the types of applications that different properties of the charge density are better suited as training data. \u0026nbsp;Simultaneous with advancements in machine learning, developments in both the theory and practice of Quantum Crystallography [32] make measurement of the charge density and its derivative properties ever more accessible and accurate. \u0026nbsp;Advancement of nanotechnology has ultimately been dependent on engineering molecular structure and reactivity [33], which chemists achieve via substitutions of one type or another - at the atomic or functional group level. \u0026nbsp;Again, this is not straightforward, because there is incomplete understanding of \u003cstrong\u003e\u003cem\u003ehow\u0026nbsp;\u003c/em\u003e\u003c/strong\u003eatomic substitutions determine changes in molecular properties. \u0026nbsp;And that is the long-term goal of this research; to advance a \u003cstrong\u003e\u003cem\u003epicotechnology\u0026nbsp;\u003c/em\u003e\u003c/strong\u003eby combining the data-mining and property-prediction power of machine learning with the ever-increasing understanding of the charge distribution - \u003cstrong\u003e\u003cem\u003ewith\u003c/em\u003e\u003c/strong\u003e \u003cstrong\u003e\u003cem\u003esubatomic resolution\u003c/em\u003e\u003c/strong\u003e.\u0026nbsp;\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003ePJM wrote the main manuscript text and KKD performed the calculations and prepared figures 1-5. All authors reviewed the manuscript.\u003c/p\u003e\u003ch2\u003eAcknowledgement\u003c/h2\u003e\u003cp\u003eFinancial support is acknowledged from the Office of Science, U.S. Department of Energy (DE-SC0005094).\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eData is provided within the manuscript or supplementary information files.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eBader RFW, MacDougall, PJ, Lau, CDH (1984) J Amer Chem Soc 106:1594\u0026ndash;1605.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBader, RFW (1990) Atoms in Molecules: A Quantum Theory. Clarendon Press, Oxford.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMacDougall PJ, Henze, CE. (2007) In: Matta CF, Boyd, RJ (eds) The Quantum Theory of Atoms in Molecules: From Solid State to DNA and Drug Design. Wiley-VCH, Weinheim.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCoppens, P, Koritsanszky, T (2001) Chem Rev 101:1583\u0026ndash;1627.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKing RD, Marchand-Geneste, N, Alsberg BK (2001) Electronic Transactions on Artificial Intelligence 5B:127\u0026ndash;142.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePopelier PLA, O\u0026rsquo;Brien SE (2001) J Chem Inf Comput Sci 41:764\u0026ndash;775. Popelier PLA, Smith, PJ (2006), Eur J Med Chem 41: 862\u0026ndash;873.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eB\u0026uuml;rgi HB, Dunitz, JD, Lehn JM, Wipff G (1974) Tetrahedron 30:1563\u0026ndash;1572.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDapprich S, Komaromi I, Byun, KS, Morokuma K, Frisch MJ (1999) J Mol Struct (THEOCHEM) 461:1\u0026ndash;21.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKoritsanszky TS, Volkov A, Chodkiewicz M (2012) Struct Bond 147:1\u0026ndash;26.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSpartan\u0026rsquo;10 program; Wavefunction Inc.: Irvine, CA.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFrisch MJ, Trucks GW, Schlegel HB, Scuseria GE, Robb MA, Cheeseman JR, Montgomery JA Jr, Vreven T, Kudin KN, Burant JC, Millam JM, Iyengar SS, Tomasi J, Barone V, Mennucci B, Cossi M, Scalmani G, Rega N, Petersson GA, Nakatsuji H, Hada M, Ehara M, Toyota K, Fukuda R, Hasegawa J, Ishida M, Nakajima T, Honda Y, Kitao O, Nakai H, Klene M, Li X, Knox JE, Hratchian HP, Cross JB, Bakken V, Adamo C, Jaramillo J, Gomperts R, Stratmann RE, Yazyev O, Austin AJ Cammi R, Pomelli C, Ochterski JW, Ayala PY, Morokuma K, Voth GA, Salvador P, Dannenberg JJ, Zakrzewski VG, Dapprich S, Daniels AD Strain MC, Farkas O, Malick DK, Rabuck AD, Raghavachari K, Foresman JB, Ortiz JV, Cui Q, Baboul AG, Clifford S, Cioslowski J, Stefanov BB, Liu G, Liashenko A, Piskorz P, Komaromi I, Martin RL, Fox DJ, Keith TA, Al-Laham MA, Peng CY, Nanayakkara A, Challacombe M.; Gill PMW, Johnson B, Chen W, Wong MW, Gonzalez C, Pople JA (2009) Gaussian 09, revision A; Gaussian Inc.: Wallingford, CT.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBecke AD (1993) J Chem Phys 98:5648\u0026ndash;5652.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLee C, Yang W, Parr RG (1988) Phys Rev B 37:785\u0026ndash;589.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHariharan PC, Pople JA (1974) Mol Phys 27:209\u0026ndash;214.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhao Y, Schultz NE, Truhlar DG (2006) J Chem Theory Comput 2:364\u0026ndash;382.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHehre WJ, Random L, Schleyer PvR, Pople JA (1986) Ab Initio Molecular Orbital Theory. Wiley, New York.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKeith TA (2012) AIMAll, Version 12.05.09, Gristmill Software, Overland Park, KS.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVolkov A, Koritsanszky TS, Chodkiewicz M, King HF (2009) J Comput Chem 30:1379\u0026ndash;1391.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSDBSWeb: http://sdbs.db.aist.go.jp (National Institute of Advanced Industrial Science and Technology, Feb 02, 2016.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKononenko I, Kukar M (2007) \u003cem\u003eMachine learning and data mining: Introduction to principles and algorithms\u003c/em\u003e, Horwood publishing.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHandley CM, Popelier PLA (2009) J Chem Theory Comput 5: 1474\u0026ndash;1489.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRumelhart DE, Hinton GE, Williams RJ (1986) Nature 323:533\u0026ndash;536.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWidrow B, Lehr MA (1990) Proc. IEEE 78:1415\u0026ndash;1442.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTerfloth L, Gasteiger J (2001) Drug Discovery Today 6:102\u0026ndash;108.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) \u003cem\u003eThe WEKA Data Mining Software: An Update\u003c/em\u003e, SIGKDD Explorations 11:1.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKruse H, Goerigk L, Grimme S (2012) J Org Chem 77:10824\u0026ndash;10834.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBader RFW, MacDougall PJ (1985) J Amer Chem Soc 107:6788\u0026ndash;6795.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaradwaj PR, Varadwaj A, Marques HM, MacDougall PJ (2019) Phys Chem Chem Phys 21:19969\u0026ndash;19986.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKoritsanszky TS, Volkov A, Chodkiewicz M (2010) Structure and Bonding 147:1\u0026ndash;25.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eThorell S, Schurmann M, Sprenger GA, Schneider G (2002) J Mol Biol 319:161\u0026ndash;171. Protein Data Bank entry code 1L6W.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLuchi AM, Villafa\u0026ntilde;e RN, G\u0026oacute;mez-Ch\u0026aacute;vez JL, Bogado ML, Angelina EL, Peruchena NM (2019) ACS Omega 4:19582\u0026ndash;19594.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMassa L, Matta CF (2017) J Comput Chem 39:1021\u0026ndash;1028.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCademartiri L, Ozin GA (2009) Concepts of Nanochemistry, Wiley VCH, Germany.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"theoretical-chemistry-accounts","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"tcac","sideBox":"Learn more about [Theoretical Chemistry Accounts](http://link.springer.com/journal/214)","snPcode":"214","submissionUrl":"https://submission.nature.com/new-submission/214/3","title":"Theoretical Chemistry Accounts","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"QTAIM, charge density topology, picotechnology, quantum atomics, machine learning","lastPublishedDoi":"10.21203/rs.3.rs-4669576/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-4669576/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eWe explore the use of machine learning to predict spectroscopic properties and interaction energies of the carbonyl groups in 225 ketones, aldehydes, imides, and amides. In the combined spirit of Density Functional Theory (DFT) and the Quantum Theory of Atoms in Molecules (QTAIM), but with an eye toward eventually using databases of transferable fragment densities, we limit the training data to small sets of descriptors (from 18 to 48 per molecule) that are based on topological features in the total charge density, ρ, and/or its Laplacian, ∇2ρ. We obtain a mean absolute error under 1% for carbonyl stretching frequencies, and just over 1% for C-13 NMR shifts. Predicting interaction energies with a model nucleophile (fluoride ion) is significantly more challenging. Mean absolute errors just over 3 kcal/mol were obtained for covalent bond formation energies. Similar mean absolute errors were obtained for much weaker van der Waals interaction energies. We also conducted a stress-test to see if our small molecule-based machine learning could predict covalent bond formation energy in a model of the active site of the \u003cem\u003eE. coli\u003c/em\u003e enzyme, D-fructose-6-phosphate aldolase.\u003c/p\u003e","manuscriptTitle":"Using Quantum Atomics and Machine Learning to Advance Picotechnology","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-08-01 14:33:11","doi":"10.21203/rs.3.rs-4669576/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2024-07-31T12:57:06+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-07-31T08:00:23+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"300395350369605274775360439051136475077","date":"2024-07-31T06:54:51+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2024-07-23T10:37:39+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"165382470205898036017004447658044243074","date":"2024-07-15T12:49:36+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"44335242772461882736909308150886580857","date":"2024-07-11T07:30:29+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"290572251122311169605661415846560413844","date":"2024-07-09T13:14:12+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2024-07-09T09:05:29+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2024-07-09T07:18:28+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2024-07-09T01:13:14+00:00","index":"","fulltext":""},{"type":"submitted","content":"Theoretical Chemistry Accounts","date":"2024-07-01T17:09:17+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"theoretical-chemistry-accounts","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"tcac","sideBox":"Learn more about [Theoretical Chemistry Accounts](http://link.springer.com/journal/214)","snPcode":"214","submissionUrl":"https://submission.nature.com/new-submission/214/3","title":"Theoretical Chemistry Accounts","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"79912be8-0317-4536-b3e7-8a9c0921a25b","owner":[],"postedDate":"August 1st, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-09-30T16:07:09+00:00","versionOfRecord":{"articleIdentity":"rs-4669576","link":"https://doi.org/10.1007/s00214-024-03142-9","journal":{"identity":"theoretical-chemistry-accounts","isVorOnly":false,"title":"Theoretical Chemistry Accounts"},"publishedOn":"2024-09-26 15:58:05","publishedOnDateReadable":"September 26th, 2024"},"versionCreatedAt":"2024-08-01 14:33:11","video":"","vorDoi":"10.1007/s00214-024-03142-9","vorDoiUrl":"https://doi.org/10.1007/s00214-024-03142-9","workflowStages":[]},"version":"v1","identity":"rs-4669576","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-4669576","identity":"rs-4669576","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.