Accuracy of AlphaFold models: Comparison with short N ... O contacts in atomic resolution protein crystal structures | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Accuracy of AlphaFold models: Comparison with short N ... O contacts in atomic resolution protein crystal structures Oliviero Carugo This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-3821040/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 01 Apr, 2024 Read the published version in Computational Biology and Chemistry → Version 1 posted You are reading this latest preprint version Abstract Artificial intelligence (AI) has revolutionized structural biology by predicting protein 3D structures with near-experimental accuracy. Here, short backbone N-O distances in high-resolution crystal structures were compared to those in three-dimensional models based on AI AlphaFold/ColabFold, specifically considering their estimated standard errors. Experimental and computationally modeled distances very often differ significantly, showing that these models' precision is inadequate to reproduce experimental results at high resolution. T-tests and normal probability plots showed that these computational methods predict atomic position standard errors 3.5–6 times bigger than experimental errors. Accuracy Artificial intelligence Estimated standard error Protein Data Bank Protein structure prediction Figures Figure 1 Figure 2 Introduction Artificial intelligence (AI) has transformed structural biology since the development of methods that can predict protein 3D structures with unprecedented and near-experimental accuracy [ 1 ]. AlphaFold2 and RoseTTAFold in 2021 [ 2 ][ 3 ], ColabFold in 2022 [ 4 ], and ESM-2 in 2023 [ 5 ] demonstrated that a protein 3D structure can be predicted in a few minutes and at a fraction of the cost of any experimental method using machine learning methods. Accordingly, the Protein Data Bank (PDB) [ 6 ][ 7 ][ 8 ], which is the traditional repository of all the known (ca. 200,000) protein 3D structures, recently included computational models ( ca. 1,000,000), the European Molecular Biology Laboratory and DeepMind – the company that created AlphaFold2 – coproduced the freely available AlphaFold Protein Structure Database [ 9 ], which contains ca. 200 million models, and recently the ESM-atlas, with more than 700 million predicted models, became available ( https://esmatlas.com ). Even the structures of protein complexes have been predicted with AlphaFold-Multimer [ 10 ], though with lower accuracy than single chain structures, especially for heteromers [ 11 ] and for antibody-antigen complexes [ 12 ]. Apart from justifiable excitement, the scientific community expressed some reservations, partly because protein modeling (homology modeling or fold recognition) had been possible for several decades [ 13 ], albeit with a lower success rate than AI methods, and partly because the performance of these revolutionary new tools needed to be carefully and thoroughly validated. General surveys of AlphaFold2’s performance appeared, showing unexpected features like for example the possibility of predicting accurately intrinsically disordered regions [ 11 ][ 14 ]. It has been suggested that AlphaFold2 models can often be suited for computational docking and target-based virtual screening [ 14 ]. According to another study, loop structures can be predicted quite reliably by AlphaFold2, though the reliability decreases as the loop length increases and long loops (more than 20 residues) should be carefully evaluated [ 15 ]. On the other hand it was shown that the structure of mutated proteins does not appear to be predictable with AlphaFold2, at least for the time being [ 16 ]. It was also observed that only on half of the chalcogen bonds observed in high resolution protein crystal structures are detected in the corresponding models available at the AlphaFold Protein Structure Database [ 17 ]. More serious objections of AI models were also advanced, owing to the fact that these new technologies are based on statistics rather than on chemical and physical principles [ 18 ]. Clearly, just as we can expect unanticipated surprises from future developments in AI, we must also expect constant monitoring of the performance of these new techniques, which, while promising, are based on very weak scientific foundations, at least in terms of the traditional Science we have known and used since Galileo Galilei. In the present communication, the short main-chain nitrogen-oxygen contacts observed in high resolution protein crystal structures are compare to those in AlphaFold2 models. For that we make use of the tools developed by Cruickshank, Blow, Helliwell and Sekar to determine the estimated standard errors of the atomic positions in protein crystal structures [ 19 ][ 20 ][ 21 ][ 22 ]. In this way it is possible to give a statistical interpretation to the comparisons between the experimental and the computational N-O contacts. For this analysis to be effective, crystal structures of extremely high resolution are necessary. At lower resolutions, the potential margin of error in atomic positions could overshadow any discernible differences between experimental and computational N-O contacts. In other words, if the experimental data isn't accurate enough, distinctions between the experimental structures and computational models become indistinguishable. We noted considerable differences between experimental and computational N-O contacts. Even when focusing on atom contacts predicted with the utmost confidence, approximately 40% of these contacts still show significant discrepancies. We also assessed the precision of atomic positions in the computational models using t-tests and normal probability plots. We found that their accuracy is substantially lower, approximately 5 times less than that of experimental data. Currently, high-resolution crystal structures offer more precise and dependable insights into protein stereochemistry. They should be prioritized over AI-generated computational models in scenarios demanding utmost accuracy, such as drug design or the analysis of reaction mechanisms. Methods Data selection All protein structures were downloaded from the Protein Data Bank [ 6 ][ 7 ][ 8 ], according to the following procedure. Only crystal structures determined in the 90–110 K temperature range and refined at resolution better than 1 Å were retained. Structures containing nucleic acids or too many non-aqueous heteroatoms (more than 5% of the protein atoms) were discarded, since these structures might be influenced by factors different from the protein amino acidic sequence. Multi-model refinements were also excluded as well as structures that have an average B-factor too large (> 25 Å 2 ) according to reference [ 23 ]. To ensure data quality, structures with missing residues were excluded [ 24 ], too. All the preserved structures are monomeric, with a single chain present in the asymmetric unit. Estimated standard errors of atomic coordinates in crystal structures The estimated standard errors associated with the coordinates of atom i th ( \({ese}_{coordinate,i}\) ) were computed as $${ese}_{coordinate,i}=DPI\bullet \sqrt{\frac{{B}_{i}}{{B}_{ave}}}$$ 1 according to reference [ 21 ], where B i is the B-factor of the i th atom, B ave is the average B-factor of the protein, and DPI is the Diffraction Precision Index, which estimates the average standard error of protein atoms’ positions and was computed according to reference [ 19 ] as $$DPI={\left(\frac{N}{p}\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.}\bullet {C}^{-\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$3$}\right.}\bullet R\bullet res$$ 2 where N is the number of atoms that were refined, p is the difference between the number of observations and the number of refined parameters, C is the fractional completeness of the data, R is the R-factor, and res is the crystallographic resolution. Both DPI and coordinate estimated standard errors were computed with the server described in reference [ 22 ]. It must be observed that in several instances it was not possible to get reliable DPI values since the percentage of fully occupied atoms is less than 90%. The attention was then focused on the following PDB entries: 1b0y, 1oew, 2pne, 3dha, 3wcq, 6cnw, 7vdn, 1ixh, 1xmk, 3agn, 3e4g, 4a02, 6fo5, 1mc2, 2fma, 3akq, 3hgp, 5nfm, 6zm8. Estimated standard errors of interatomic distances in crystal structures The estimated standard errors associated with interatomic distances between atoms i th and j th were computed as $${ese}_{contact}=\sqrt{{ese}_{coordinate,i}^{2}+{ese}_{coordinate,j}^{2}}$$ 3 according to reference [ 25 ]. Computational modelling All computational models were built with ColabFold [ 4 ] by using AlphaFold2 [ 2 ] for structure prediction with three prediction cycles. ColabFold is much faster than AlphaFold2 because it uses a faster homology search technique (MMseqs2) [ 26 ], being for the rest analogous to AlphaFold2. The use of templates from the Protein Data Bank was allowed and Amber energy minimization was applied. The top-ranked model was retained for the comparison with the experimental structure. Comparison between experimental and computed N-O contacts The experimental N-O distances ( d exp ) observed in the crystal structures and their counterparts ( d mod ) observed in the computational models can be compared with a t -test $$t=\frac{\left|{d}_{exp}-{d}_{mod}\right|}{{ese}_{contact,exp}}$$ 4 where ese contact,exp is estimated standard error of d exp computed with Eq. ( 3 ). Experimental and computed distances were considered significantly different if t > 2.576 (0.99 probability level [ 27 ]). Only contacts with d exp < 4 Å were considered, though very similar results were obtained by lowering this threshold to 3.5 Å or by increasing it to 4.5 Å. In case of conformational disorder, when an atom has two or more equilibrium positions, only the first was considered and the others were discarded. Estimated standard errors of atomic coordinates in computational models The experimental N-O distance ( d exp ) observed in the crystal structure and its counterpart ( d mod ) observed in the computational model are statistically identical (0.99 probability level; [ 28 ]) if $$\frac{\left|{d}_{exp}-{d}_{mod}\right|}{\sqrt{{ese}_{contact,exp}^{2}+{ese}_{contact,mod}^{2}}}<2.576$$ 5 where ese contact,exp and ese contact,mod are the estimated standard errors associated with d exp and d mod . Consequently, it is possible to compute the minimal ese contact,mod value that makes d exp and d mod statistically equal $$\sqrt{\frac{{\left({d}_{exp}-{d}_{mod}\right)}^{2}-6.636\bullet {ese}_{contact,exp}^{2}}{6.636}}0$$ 7 i.e. when ese contact.exp is not large enough to make d exp and d mod statistically equal independently of ese contact,mod . Then, by assuming that nitrogen and oxygen have the same coordinate estimated standard error in the computational model, it is possible to estimate it ( ese coordinate,mod ) $${ese}_{coordinate,mod}=\frac{{ese}_{contact,mod}}{\sqrt{2}}$$ 8 Estimated standard errors through normal probability plots A different approach for estimating the accuracy of the N-O distances in the computational models is provided by normal probability plots (Npp) [ 29 ][ 30 ]. Npps have been designed to compare two sets of experimental data ( X and Y ), each characterized by n variables. The two i th (1 ≤ i ≤ n ) variables x i and y i 's difference, d i , is calculated as $${d}_{i}=\frac{\left({x}_{i}-{y}_{i}\right)}{\sqrt{{sx}_{i}^{2}+{sy}_{i}^{2}}}$$ 9 where sx i and sy i are the standard errors of x i and y i , respectively. The n observed d i values, sorted in order of increasing amplitude, are then plotted against the expected de i values, which can be computed as it follows $${de}_{i}=\left|\frac{n-2i+1}{n}\right|$$ 10 with sign negative, when i n /2. If a regression line with unit slope and zero intercept fits the n data points, then X = Y . In the absence of this, it can be assumed that either X ≠ Y or that the standard errors sx i and sy i are underestimated. The second supposition holds that standard errors of the data can be estimated using Npps: the observed differences d i are plotted against their predicted values de i and the slope a of the regression line ( d = a de ) is used to calculate the average standard error sigma as $$sigma=a/\sqrt{2}$$ 11 Here X and Y are the N-O distances in the crystal structure and in computational models and sigma is the average positional estimated standard error, shared by the crystal structures and the computational models. Obviously, if the average positional estimated standard error of the crystal structures is known, it can be subtracted and it is thus possible to get an estimation of the average positional estimated standard error in the computational models as $${ese}_{coordinate,mod}=\sqrt{{sigma}^{2}-{ese}_{coordinate,exp}^{2}}$$ 12 Miscellaneous Secondary structures were assigned with Stride [ 31 ] and solvent accessibility of atoms and residues was measured with Naccess [ 32 ]. Secondary structures were reduced to three states: helices (α, π, and 3 10 helices associated with the labels H, I, and G in Stride), extended (β-strands and bridges associate with the labels E, B, and b), and loops (all the rest). Results and Discussion Models’ quality and accuracy All models were built with ColabFold [ 4 ] as described in the Methods. This software, like ALphaFold2 [ 2 ], computes a variable called pLDDT for each residue that represents the expected trustworthiness of the prediction. pLDDT values range from 0 to 100, and if pLDDT > 90, the positions of both main-chain and side-chain atoms are considered to be reliably predicted; if 70 < pLDDT < 90, errors are mostly possible in side-chain atoms; if 50 < pLDDT < 70, errors are also possible in main-chain atoms; and if pLDDT < 50, predictions are essentially unreliable - in other words, predictions were impossible. Highly reliable models were possible with one exception only. The average pLDDT value is very close to or larger than 90 for nearly all models (Fig. 1a) and the distribution of the pLDDT values shows that they are very seldom smaller than 90 despite a small peak close to 40 (Fig. 1b). Only one of the models is unreliable. Its average pLDDT is very small (ca. 39) and well below the threshold value (50) under which predictions are considered to be impossible. Not surprisingly, in this case, the difference between the computational model and the crystal structure is enormous, too (rmsd = 18 Å). This is an antifreeze protein from Hypogastrura harveyi (2pne) that folds in a compact structure similar to a cuboid parallelepiped, the two largest and opposite faces of which are formed by three antiparallel left-handed polyproline II helixes. There are many hydrogen bonds (342), more than 4 per residue. We are not aware of any research on AlphaFold2 or ColabFold's capacity to forecast polyproline helices; nevertheless, this could be a shortcoming of both AI-based computational approaches and the incorrect AI-based modelling of antifreeze protein fold has already be observed [ 33 ]. Comparison between N-O contacts The contacts between a nitrogen and an oxygen atom of the main-chain shorter than 4 Å were considered. As stated in the Methods, findings were remarkably comparable whether this threshold was raised or lowered by 0.5 Å. According to Eq. ( 4 ), t-values were calculated for each pair of experimental and computational contacts, and if t > 2.576 (0.99 probability threshold), the experimental and computational distances were deemed to be different. The average pLDDT of the two atoms was associated with nitrogen-oxygen contacts. The average t-values estimated at various pLDDT levels are much bigger than 2.576, as shown in Table 1 , and they increase as pLDDT lowers as expected. Furthermore, a considerable percentage of the experimental N-O contacts differ statistically from their computationally modelled counterparts and, as expected, as pLDDT decreases, the percentage of cases where the difference is statistically significant rises. Secondary structure and solvent accessibility were also used to classify N-O contacts. The fraction of experimental N-O contacts that vary significantly from their computationally simulated equivalents varies marginally with secondary structure and is mostly independent of the residues' relative solvent exposed surface area. On the contrary, when the atomic solvent accessible surface area of the atoms grows, this proportion falls (Table 2 ). This implies that the experimental and modelled distances tend to statistically equalize for atoms that are more exposed to the solvent, which may simply be due to the fact that these atoms are more exposed and hence have greater B-factors and, as a result, larger positional standard errors (Eq. 1 ). The crystal structures considered in the present study are the most accurate protein crystal structures determined to date. In addition to the extraordinarily high resolution that is comparable to that achievable in small asymmetric unit crystallography, the absence of structurally undetermined protein sections makes these crystal structures the crème de la crème of the protein structural data currently available. Although, as illustrated above, the computational models are certainly of good quality, they appear to be less accurate than these top-quality crystal structures. If a chemical understanding of the structure is required, with an accurate description of the stereochemistry and of the energy associated with it, achieving experimental high resolution is still preferable to employing computer models generated with artificial intelligence. Table 1 Analyses of t-values computed with Eq. ( 4 ). Number of observations (obs), average value of t, with standard deviation in parentheses, (ave and std, respectively) and percentage of times t > 2.576. pLDDT range obs ave (std) pc any 2415 12.1 (1.0) 40.2 pLDDT > 90 2224 4.9 (0.3) 39.4 70 < pLDDT ≤ 90 55 23.1 (7.8) 39.6 50 < pLDDT ≤ 70 11 44.2 (17.2) 57.9 pLDDT ≤ 50 125 198.6 (22.8) 60.4 Table 2 Percentages (pc) of N-O contacts where the experimental and the modelled distances are significantly different at increasing values of the average solvent accessible surface area (SASA, Å) of the nitrogen and oxygen atoms. SASA (Å) pc 0–1 43.7 1–2 35.7 2–3 32.3 3–4 33.5 4–5 33.1 5–6 29.5 6–7 32.2 7–8 34.7 8–9 26.5 Estimated standard errors of the atomic positions in computational models Equations ( 5 – 8 ) allow the determination of the estimated standard error of the atomic position in computational models ( \({ese}_{coordinate,mod}\) ). This is the smallest coordinate error required to equalize the experimental and computed N-O distances. This analysis was performed only on two categories of N-O contacts, those with average pLDDT > 90 and those with 70 < pLDDT ≤ 90. Three factors led to the rejection of lower pLDDT values: first, there are few observations of N-O contacts with pLDDT ≤ 70; second, given that there is little confidence in the forecast, these examples are not noteworthy; third, nearly all the observations with pLDDT ≤ 50 are concentrated in a single structure that was impossible to model (2pne) as described above. The \({ese}_{coordinate,mod}\) were computed in this way for each N-O contact, by assuming that the nitrogen and the oxygen atom have the same \({ese}_{coordinate,mod}\) , and then averaged. As expected, their values, shown in Fig. 2 (upper part), are smaller for higher levels of pLDDT and larger as pLDDT decreases. In any case, they are much larger than the estimated standard errors of the coordinates in the crystal structures, which have an average value of 0.0149 (± 0.0001) Å. This is also evident in the distributions of the estimated standard errors for the experimental and modelled structures (Fig. 2, lower part): The ones from experiments are almost always smaller than 0.03, while the ones from models can be much bigger. Similarly, it is possible to compute the \({ese}_{coordinate,mod}\) through normal probability plots (Npp) as described in the Methods. In this case, all N-O distances observed in the crystal structures are compared to their counterpart in the computational models to get an average estimated standard error, the value of which is reported in Fig. 2 (upper part). For the N-O contacts that have an average pLDDT larger than 90, the Npp estimated standard error is slightly larger than that computed with equations ( 5 – 8 ). On the contrary it is slightly smaller for the N-O contacts in the 70–90 pLDDT range. Both these values remain, however, much larger that the estimated standard errors of the crystal structures. Eventually, the estimated standard errors of the atomic positions in the models built with ColabFold are 3.5-6 times larger than the estimated standard errors of the atomic positions in the crystal structures examined in the present study. Conclusions Short backbone N-O distances found in high-resolution crystal structures were compared with those found in three-dimensional models created using techniques based on artificial intelligence (AI) AlphaFold/ColabFold, by explicitly taking into account their estimated standard errors. It has been observed that experimental and computationally modeled distances often vary significantly, indicating that these models' accuracy is insufficient to replicate experimental findings. It was possible to determine, by using t-tests and normal probability plots, that the estimated standard errors of the atomic positions predicted by these computational techniques are 3.5–6 times greater than the experimental errors. Declarations Ethics approval and consent to participate: Not applicable. Consent for publication: Not applicable. Availability of data and materials: All data are taken from the Protein Data Bank; further details are available on request. Competing interests: No competing interests. Funding : Not applicable. Authors' contributions: OC, who is the only Author, designed and performed all the analyses and wrote the manscript. Acknowledgments: A. Corelli is gratefully acknowledged for constant support and K. Djinović for helpful discussion. Dr. K. Sekar (Indian Institute of Science, Bangalore) is gratefully acknowledged for his help in computing positional standard errors in protein crystal structures. The author acknowledges support from the Ministero dell’Università e della Ricerca (MUR) and the University of Pavia through the program “Dipartimenti di Eccellenza 2023–2027”. References Carugo O, Djinović-Carugo K. Structural biology: A golden era. PLoS Biol. 2023;21:e3002187. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:584–9. Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871–6. Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Met. 2022;19:679–82. Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. Bernstein FC, Koetzle TF, Williams GJB, Meyer EFJ, Brice MD, Rodgers JR, et al. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977;112:535–42. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–42. http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve &db=PubMed&dopt=Citation&list_uids=10592235. wwPDB Consortium. Protein Data Bank: The single global archive fro 3D macromolecular structural data. Nucleic Acids Res. 2019;47:D520–8. Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucl Acids Res. 2022;50:D439–44. Evans R, O’Neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2022 2021.10.04.463034. doi:10.1101/2021.10.04.463034 . Zhu W, Shenoy A, Kundrotas P, Elofsson A. Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes. Bioinformatics. 2023;39. Yin R, Feng BY, Varshney A, Pierce BG. Benchmarking AlphaFold for protein complex modeling reveals accuracy determinants. Protein Sci. 2022;31:e4379. Tramontano A. Protein Structure Prediction: Concepts and Applications. New York: John Wiley & Sons; 2006. Binder JL, Berendzen J, Stevens AO, He Y, Wang J, Dokholyan NV, et al. AlphaFold illuminates half of the dark human proteins. Curr Opin Struct Biol. 2022;74:102372. Stevens AO, He Y. Benchmarking the Accuracy of AlphaFold 2 in Loop Structure Prediction. Biomolecules. 2022;12:985. Buel GR, Walters KJ. Can AlphaFold2 predict the impact of missense mutations on structure? Nat Struct Mol Biol. 2022;29:1–2. Carugo O, Djinovic-Carugo K. Automated identification of chalcogen bonds in AlphaFold protein structure database files: is it possible? Front Mol Biosci. 2023;10:1155629. Moore PB, Hendrickson WA, Henderson R, Brunger AT. The protein-folding problem: Not yet solved. Sci (80-). 2022;375:507–7. Cruickshank DWJ. Remarks about protein structure precision. Acta Cryst. 1999;D55:583–93. Blow DM. Rearrangement of Cruickshank’s formulae for the diffraction-component precision index. Acta Cryst. 2002;D58:792–7. Gurusaran M, Shankar M, Nagarajan R, Helliwell JR, Sekar K. Do we see what we should see? Describing non-covalent interactions in protein structures including precision. IUCrJ. 2014;1:74–81. Dinesh Kumar KS, Gurusaran M, Satheesh SN, Radha P, Pavithra S, Thulaa Tharshan KPS, et al. Online_DPI: a web server to calculate the diffraction precision index for a protein structure. J Appl Cryst. 2015;48:939–42. Carugo O. How large B-factors can be in protein crystal structures. BMC Bioinformatics. 2018;19:61. 10.1186/s12859-018-2083-8 . Djinovic Carugo K, Carugo O. Missing strings of residues in protein crystal structures. Intrinsically Disord Proteins. 2015;3:1–7. Giacovazzo C, Monaco HL, Artioli G, Viterbo D, Ferraris G, Gilli G, et al. Fundamentals of Crystallography. Oxford: Oxford University Press; 2002. Mirdita M, Steinegger M, Söding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35:2856–8. Dowdy S, Wearden S, Chilko D. Statistics for research. Hoboken: John Wiley & Sons; 2004. Cruickshank DWJ, Robertson AP. The comparison of theoretical and experimental determinations of molecular structures, with applications to naphthalene and anthracene. Acta Cryst. 1953;6:698–705. Abrahams SC, Keve ET. Normal probability plot analysis of error in measured and derived quantities and standard deviations. Acta Crystallogr. 1971;A27:157–61. Hamilton WC, Abrahams SC. Normal probability plot analysis of small samples. Acta Cryst. 1972;A28:215–8. Heinig M, Frishman D. STRIDE: A web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res. 2004;32:w500–2. Hubbard SJ, Thornton JM, NACCESS. Department of Biochemistry and Molecular Biology, University College London. 1993. Laurents DV. AlphaFold 2 and NMR Spectroscopy: Partners to Understand Protein Structure, Dynamics and Function. Front Mol Biosci. 2022;9:906437. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 01 Apr, 2024 Read the published version in Computational Biology and Chemistry → Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-3821040","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":264605292,"identity":"36ff6303-e8e7-4010-a588-22cd8b559652","order_by":0,"name":"Oliviero Carugo","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABF0lEQVRIiWNgGAWjYDACdsYGhgQkvhwDA5oIBmBGU2CMrAXIwqYFjZ+IrAqrFn5m5jaJBxX35BjYDz/7zPPLJn3D7WaQCIM8fwPz8wdYtEg2M7ZJJJwpNmbgSTOezduXlrvhzkGQCIPhjANshthsMTjM2GyQ2JaQ2CDBYMzM23M4d8ONRJAIA+MGBgY8Wv6BtLB/BmlJNwBr+cdgv4GB/SMOLY0PEhtAWniMmXl+HE4AagGJMCRuYODBagvQL40PEo4lGLPx5BQzzm1IM5wJ0pJwTCJ5xmGewhnYQoy9/cHBHzUJcvzsxzczvPljI893Ix0kYmPb396+4QMWLXDABiIY2+B8Ccwoww7+EKVqFIyCUTAKRhgAACxjY0Sgyg8ZAAAAAElFTkSuQmCC","orcid":"","institution":"University of Pavia","correspondingAuthor":true,"prefix":"","firstName":"Oliviero","middleName":"","lastName":"Carugo","suffix":""}],"badges":[],"createdAt":"2023-12-29 11:29:20","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-3821040/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-3821040/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1016/j.compbiolchem.2024.108069","type":"published","date":"2024-04-01T14:48:50+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":49104152,"identity":"67718635-a419-45a7-bf14-640cae494443","added_by":"auto","created_at":"2024-01-03 06:50:31","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":62912,"visible":true,"origin":"","legend":"\u003cp\u003eQuality and reliability of the computational models. (\u003cstrong\u003ea\u003c/strong\u003e) Average values (with standard deviations in parentheses) of the pLDDT values of the computational models and rmsd values (Å) after Ca atoms’ superposition between the computational models and the corresponding crystal structures. (\u003cstrong\u003eb\u003c/strong\u003e) Distribution of the pLDDT values in all the computational models. (\u003cstrong\u003ec\u003c/strong\u003e) Superposition of the computational model's Ca atoms on the crystal structure of the unique case when it was impossible to build a reliable computational model, which is entirely different from the crystal structure (the model is depicted in black and the crystal structure in gray).\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-3821040/v1/e815e8cf88e4623eecbea147.png"},{"id":49104151,"identity":"9501cf48-770b-43a4-943a-967db1ebfb19","added_by":"auto","created_at":"2024-01-03 06:50:31","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":22384,"visible":true,"origin":"","legend":"\u003cp\u003eEstimated standard errors (Å) of the atomic positions in the computational models built with ColbFold. (\u003cstrong\u003eup\u003c/strong\u003e) Average values computed with equations (5-8) with standard deviations in parentheses and values computed with the normal probability plots (Npp). (\u003cstrong\u003edown\u003c/strong\u003e) Distributions of the positional estimated standard errors of the crystal structures (continuous line) and of the computational models (broken line).\u003c/p\u003e","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-3821040/v1/fd18e9eaaeef67c73289ae7d.png"},{"id":54215934,"identity":"812c31dc-c3da-40f5-a7fa-fd4aeb7ca0ed","added_by":"auto","created_at":"2024-04-06 14:48:56","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":702476,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-3821040/v1/eb4911aa-bd70-4267-87a0-7cfcaacc1f12.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Accuracy of AlphaFold models: Comparison with short N ... O contacts in atomic resolution protein crystal structures","fulltext":[{"header":"Introduction","content":"\u003cp\u003eArtificial intelligence (AI) has transformed structural biology since the development of methods that can predict protein 3D structures with unprecedented and near-experimental accuracy [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e]. AlphaFold2 and RoseTTAFold in 2021 [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e][\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e], ColabFold in 2022 [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e], and ESM-2 in 2023 [\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e] demonstrated that a protein 3D structure can be predicted in a few minutes and at a fraction of the cost of any experimental method using machine learning methods. Accordingly, the Protein Data Bank (PDB) [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e][\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e][\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], which is the traditional repository of all the known (ca. 200,000) protein 3D structures, recently included computational models (\u003cem\u003eca.\u003c/em\u003e 1,000,000), the European Molecular Biology Laboratory and DeepMind \u0026ndash; the company that created AlphaFold2 \u0026ndash; coproduced the freely available AlphaFold Protein Structure Database [\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e], which contains \u003cem\u003eca.\u003c/em\u003e 200\u0026nbsp;million models, and recently the ESM-atlas, with more than 700\u0026nbsp;million predicted models, became available (\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://esmatlas.com\u003c/span\u003e\u003cspan address=\"https://esmatlas.com\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eEven the structures of protein complexes have been predicted with AlphaFold-Multimer [\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e], though with lower accuracy than single chain structures, especially for heteromers [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e] and for antibody-antigen complexes [\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eApart from justifiable excitement, the scientific community expressed some reservations, partly because protein modeling (homology modeling or fold recognition) had been possible for several decades [\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e], albeit with a lower success rate than AI methods, and partly because the performance of these revolutionary new tools needed to be carefully and thoroughly validated.\u003c/p\u003e \u003cp\u003eGeneral surveys of AlphaFold2\u0026rsquo;s performance appeared, showing unexpected features like for example the possibility of predicting accurately intrinsically disordered regions [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e][\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. It has been suggested that AlphaFold2 models can often be suited for computational docking and target-based virtual screening [\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e]. According to another study, loop structures can be predicted quite reliably by AlphaFold2, though the reliability decreases as the loop length increases and long loops (more than 20 residues) should be carefully evaluated [\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eOn the other hand it was shown that the structure of mutated proteins does not appear to be predictable with AlphaFold2, at least for the time being [\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e]. It was also observed that only on half of the chalcogen bonds observed in high resolution protein crystal structures are detected in the corresponding models available at the AlphaFold Protein Structure Database [\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eMore serious objections of AI models were also advanced, owing to the fact that these new technologies are based on statistics rather than on chemical and physical principles [\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eClearly, just as we can expect unanticipated surprises from future developments in AI, we must also expect constant monitoring of the performance of these new techniques, which, while promising, are based on very weak scientific foundations, at least in terms of the traditional Science we have known and used since Galileo Galilei.\u003c/p\u003e \u003cp\u003eIn the present communication, the short main-chain nitrogen-oxygen contacts observed in high resolution protein crystal structures are compare to those in AlphaFold2 models.\u003c/p\u003e \u003cp\u003eFor that we make use of the tools developed by Cruickshank, Blow, Helliwell and Sekar to determine the estimated standard errors of the atomic positions in protein crystal structures [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e][\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e][\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e][\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e]. In this way it is possible to give a statistical interpretation to the comparisons between the experimental and the computational N-O contacts.\u003c/p\u003e \u003cp\u003eFor this analysis to be effective, crystal structures of extremely high resolution are necessary. At lower resolutions, the potential margin of error in atomic positions could overshadow any discernible differences between experimental and computational N-O contacts. In other words, if the experimental data isn't accurate enough, distinctions between the experimental structures and computational models become indistinguishable.\u003c/p\u003e \u003cp\u003eWe noted considerable differences between experimental and computational N-O contacts. Even when focusing on atom contacts predicted with the utmost confidence, approximately 40% of these contacts still show significant discrepancies.\u003c/p\u003e \u003cp\u003eWe also assessed the precision of atomic positions in the computational models using t-tests and normal probability plots. We found that their accuracy is substantially lower, approximately 5 times less than that of experimental data.\u003c/p\u003e \u003cp\u003eCurrently, high-resolution crystal structures offer more precise and dependable insights into protein stereochemistry. They should be prioritized over AI-generated computational models in scenarios demanding utmost accuracy, such as drug design or the analysis of reaction mechanisms.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eData selection\u003c/h2\u003e \u003cp\u003eAll protein structures were downloaded from the Protein Data Bank [\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e][\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e][\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e], according to the following procedure. Only crystal structures determined in the 90\u0026ndash;110 K temperature range and refined at resolution better than 1 \u0026Aring; were retained. Structures containing nucleic acids or too many non-aqueous heteroatoms (more than 5% of the protein atoms) were discarded, since these structures might be influenced by factors different from the protein amino acidic sequence. Multi-model refinements were also excluded as well as structures that have an average B-factor too large (\u0026gt;\u0026thinsp;25 \u0026Aring;\u003csup\u003e2\u003c/sup\u003e) according to reference [\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e]. To ensure data quality, structures with missing residues were excluded [\u003cspan citationid=\"CR24\" class=\"CitationRef\"\u003e24\u003c/span\u003e], too. All the preserved structures are monomeric, with a single chain present in the asymmetric unit.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eEstimated standard errors of atomic coordinates in crystal structures\u003c/h2\u003e \u003cp\u003eThe estimated standard errors associated with the coordinates of atom i\u003csup\u003eth\u003c/sup\u003e (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({ese}_{coordinate,i}\\)\u003c/span\u003e\u003c/span\u003e) were computed as\u003cdiv id=\"Equ1\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ1\" name=\"EquationSource\"\u003e\n$${ese}_{coordinate,i}=DPI\\bullet \\sqrt{\\frac{{B}_{i}}{{B}_{ave}}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e1\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eaccording to reference [\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e], where \u003cem\u003eB\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e is the B-factor of the i\u003csup\u003eth\u003c/sup\u003e atom, \u003cem\u003eB\u003c/em\u003e\u003csub\u003e\u003cem\u003eave\u003c/em\u003e\u003c/sub\u003e is the average B-factor of the protein, and \u003cem\u003eDPI\u003c/em\u003e is the Diffraction Precision Index, which estimates the average standard error of protein atoms\u0026rsquo; positions and was computed according to reference [\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e] as\u003cdiv id=\"Equ2\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ2\" name=\"EquationSource\"\u003e\n$$DPI={\\left(\\frac{N}{p}\\right)}^{\\raisebox{1ex}{$1$}\\!\\left/ \\!\\raisebox{-1ex}{$2$}\\right.}\\bullet {C}^{-\\raisebox{1ex}{$1$}\\!\\left/ \\!\\raisebox{-1ex}{$3$}\\right.}\\bullet R\\bullet res$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e2\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eN\u003c/em\u003e is the number of atoms that were refined, \u003cem\u003ep\u003c/em\u003e is the difference between the number of observations and the number of refined parameters, \u003cem\u003eC\u003c/em\u003e is the fractional completeness of the data, \u003cem\u003eR\u003c/em\u003e is the R-factor, and \u003cem\u003eres\u003c/em\u003e is the crystallographic resolution.\u003c/p\u003e \u003cp\u003eBoth \u003cem\u003eDPI\u003c/em\u003e and coordinate estimated standard errors were computed with the server described in reference [\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eIt must be observed that in several instances it was not possible to get reliable DPI values since the percentage of fully occupied atoms is less than 90%. The attention was then focused on the following PDB entries: 1b0y, 1oew, 2pne, 3dha, 3wcq, 6cnw, 7vdn, 1ixh, 1xmk, 3agn, 3e4g, 4a02, 6fo5, 1mc2, 2fma, 3akq, 3hgp, 5nfm, 6zm8.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003eEstimated standard errors of interatomic distances in crystal structures\u003c/h2\u003e \u003cp\u003eThe estimated standard errors associated with interatomic distances between atoms i\u003csup\u003eth\u003c/sup\u003e and j\u003csup\u003eth\u003c/sup\u003e were computed as\u003cdiv id=\"Equ3\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ3\" name=\"EquationSource\"\u003e\n$${ese}_{contact}=\\sqrt{{ese}_{coordinate,i}^{2}+{ese}_{coordinate,j}^{2}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e3\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eaccording to reference [\u003cspan citationid=\"CR25\" class=\"CitationRef\"\u003e25\u003c/span\u003e].\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003eComputational modelling\u003c/h2\u003e \u003cp\u003eAll computational models were built with ColabFold [\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e] by using AlphaFold2 [\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e] for structure prediction with three prediction cycles. ColabFold is much faster than AlphaFold2 because it uses a faster homology search technique (MMseqs2) [\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e], being for the rest analogous to AlphaFold2. The use of templates from the Protein Data Bank was allowed and Amber energy minimization was applied. The top-ranked model was retained for the comparison with the experimental structure.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003eComparison between experimental and computed N-O contacts\u003c/h2\u003e \u003cp\u003eThe experimental N-O distances (\u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003eexp\u003c/em\u003e\u003c/sub\u003e) observed in the crystal structures and their counterparts (\u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003emod\u003c/em\u003e\u003c/sub\u003e) observed in the computational models can be compared with a \u003cem\u003et\u003c/em\u003e-test\u003cdiv id=\"Equ4\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ4\" name=\"EquationSource\"\u003e\n$$t=\\frac{\\left|{d}_{exp}-{d}_{mod}\\right|}{{ese}_{contact,exp}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e4\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eese\u003c/em\u003e\u003csub\u003e\u003cem\u003econtact,exp\u003c/em\u003e\u003c/sub\u003e is estimated standard error of \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003eexp\u003c/em\u003e\u003c/sub\u003e computed with Eq.\u0026nbsp;(\u003cspan refid=\"Equ3\" class=\"InternalRef\"\u003e3\u003c/span\u003e). Experimental and computed distances were considered significantly different if t\u0026thinsp;\u0026gt;\u0026thinsp;2.576 (0.99 probability level [\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e]). Only contacts with \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003eexp\u003c/em\u003e\u003c/sub\u003e \u0026lt; 4 \u0026Aring; were considered, though very similar results were obtained by lowering this threshold to 3.5 \u0026Aring; or by increasing it to 4.5 \u0026Aring;. In case of conformational disorder, when an atom has two or more equilibrium positions, only the first was considered and the others were discarded.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eEstimated standard errors of atomic coordinates in computational models\u003c/h2\u003e \u003cp\u003eThe experimental N-O distance (\u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003eexp\u003c/em\u003e\u003c/sub\u003e) observed in the crystal structure and its counterpart (\u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003emod\u003c/em\u003e\u003c/sub\u003e) observed in the computational model are statistically identical (0.99 probability level; [\u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e]) if\u003cdiv id=\"Equ5\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ5\" name=\"EquationSource\"\u003e\n$$\\frac{\\left|{d}_{exp}-{d}_{mod}\\right|}{\\sqrt{{ese}_{contact,exp}^{2}+{ese}_{contact,mod}^{2}}}\u0026lt;2.576$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e5\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003eese\u003c/em\u003e\u003csub\u003e\u003cem\u003econtact,exp\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003eese\u003c/em\u003e\u003csub\u003e\u003cem\u003econtact,mod\u003c/em\u003e\u003c/sub\u003e are the estimated standard errors associated with \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003eexp\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003emod\u003c/em\u003e\u003c/sub\u003e. Consequently, it is possible to compute the minimal \u003cem\u003eese\u003c/em\u003e\u003csub\u003e\u003cem\u003econtact,mod\u003c/em\u003e\u003c/sub\u003e value that makes \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003eexp\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003emod\u003c/em\u003e\u003c/sub\u003e statistically equal\u003cdiv id=\"Equ6\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ6\" name=\"EquationSource\"\u003e\n$$\\sqrt{\\frac{{\\left({d}_{exp}-{d}_{mod}\\right)}^{2}-6.636\\bullet {ese}_{contact,exp}^{2}}{6.636}}\u0026lt;{ese}_{contact,mod}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e6\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhen\u003cdiv id=\"Equ7\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ7\" name=\"EquationSource\"\u003e\n$$\\left[{\\left({d}_{exp}-{d}_{mod}\\right)}^{2}-6.636\\bullet {ese}_{contact,exp}^{2}\\right]\u0026gt;0$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e7\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ei.e. when \u003cem\u003eese\u003c/em\u003e\u003csub\u003e\u003cem\u003econtact.exp\u003c/em\u003e\u003c/sub\u003e is not large enough to make \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003eexp\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003emod\u003c/em\u003e\u003c/sub\u003e statistically equal independently of \u003cem\u003eese\u003c/em\u003e\u003csub\u003e\u003cem\u003econtact,mod\u003c/em\u003e\u003c/sub\u003e. Then, by assuming that nitrogen and oxygen have the same coordinate estimated standard error in the computational model, it is possible to estimate it (\u003cem\u003eese\u003c/em\u003e\u003csub\u003e\u003cem\u003ecoordinate,mod\u003c/em\u003e\u003c/sub\u003e)\u003cdiv id=\"Equ8\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ8\" name=\"EquationSource\"\u003e\n$${ese}_{coordinate,mod}=\\frac{{ese}_{contact,mod}}{\\sqrt{2}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e8\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003eEstimated standard errors through normal probability plots\u003c/h2\u003e \u003cp\u003eA different approach for estimating the accuracy of the N-O distances in the computational models is provided by normal probability plots (Npp) [\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e][\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e].\u003c/p\u003e \u003cp\u003eNpps have been designed to compare two sets of experimental data (\u003cb\u003eX\u003c/b\u003e and \u003cb\u003eY\u003c/b\u003e), each characterized by \u003cem\u003en\u003c/em\u003e variables. The two i\u003csup\u003eth\u003c/sup\u003e (1\u0026thinsp;\u0026le;\u0026thinsp;i\u0026thinsp;\u0026le;\u0026thinsp;\u003cem\u003en\u003c/em\u003e) variables \u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003ey\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e's difference, \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e, is calculated as\u003cdiv id=\"Equ9\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ9\" name=\"EquationSource\"\u003e\n$${d}_{i}=\\frac{\\left({x}_{i}-{y}_{i}\\right)}{\\sqrt{{sx}_{i}^{2}+{sy}_{i}^{2}}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e9\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewhere \u003cem\u003esx\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003esy\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e are the standard errors of \u003cem\u003ex\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003ey\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e, respectively.\u003c/p\u003e \u003cp\u003eThe \u003cem\u003en\u003c/em\u003e observed \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e values, sorted in order of increasing amplitude, are then plotted against the expected \u003cem\u003ede\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e values, which can be computed as it follows\u003cdiv id=\"Equ10\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ10\" name=\"EquationSource\"\u003e\n$${de}_{i}=\\left|\\frac{n-2i+1}{n}\\right|$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e10\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003ewith sign negative, when i\u0026thinsp;\u0026lt;\u0026thinsp;\u003cem\u003en\u003c/em\u003e/2, or positive, when i\u0026thinsp;\u0026gt;\u0026thinsp;\u003cem\u003en\u003c/em\u003e/2.\u003c/p\u003e \u003cp\u003eIf a regression line with unit slope and zero intercept fits the \u003cem\u003en\u003c/em\u003e data points, then \u003cb\u003eX\u003c/b\u003e\u0026thinsp;=\u0026thinsp;\u003cb\u003eY\u003c/b\u003e. In the absence of this, it can be assumed that either \u003cb\u003eX\u003c/b\u003e\u0026thinsp;\u0026ne;\u0026thinsp;\u003cb\u003eY\u003c/b\u003e or that the standard errors \u003cem\u003esx\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e and \u003cem\u003esy\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e are underestimated.\u003c/p\u003e \u003cp\u003eThe second supposition holds that standard errors of the data can be estimated using Npps: the observed differences \u003cem\u003ed\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e are plotted against their predicted values \u003cem\u003ede\u003c/em\u003e\u003csub\u003e\u003cem\u003ei\u003c/em\u003e\u003c/sub\u003e and the slope \u003cem\u003ea\u003c/em\u003e of the regression line (\u003cem\u003ed\u003c/em\u003e\u0026thinsp;=\u0026thinsp;\u003cem\u003ea de\u003c/em\u003e) is used to calculate the average standard error \u003cem\u003esigma\u003c/em\u003e as\u003cdiv id=\"Equ11\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ11\" name=\"EquationSource\"\u003e\n$$sigma=a/\\sqrt{2}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e11\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003cp\u003eHere \u003cb\u003eX\u003c/b\u003e and \u003cb\u003eY\u003c/b\u003e are the N-O distances in the crystal structure and in computational models and \u003cem\u003esigma\u003c/em\u003e is the average positional estimated standard error, shared by the crystal structures and the computational models. Obviously, if the average positional estimated standard error of the crystal structures is known, it can be subtracted and it is thus possible to get an estimation of the average positional estimated standard error in the computational models as\u003cdiv id=\"Equ12\" class=\"Equation\"\u003e\u003cdiv format=\"TEX\" class=\"mathdisplay\" id=\"FileID_Equ12\" name=\"EquationSource\"\u003e\n$${ese}_{coordinate,mod}=\\sqrt{{sigma}^{2}-{ese}_{coordinate,exp}^{2}}$$\u003c/div\u003e\u003cdiv class=\"EquationNumber\"\u003e12\u003c/div\u003e\u003c/div\u003e\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eMiscellaneous\u003c/h3\u003e\n\u003cp\u003eSecondary structures were assigned with Stride [\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e] and solvent accessibility of atoms and residues was measured with Naccess [\u003cspan citationid=\"CR32\" class=\"CitationRef\"\u003e32\u003c/span\u003e]. Secondary structures were reduced to three states: helices (α, π, and 3\u003csub\u003e10\u003c/sub\u003e helices associated with the labels H, I, and G in Stride), extended (β-strands and bridges associate with the labels E, B, and b), and loops (all the rest).\u003c/p\u003e"},{"header":"Results and Discussion","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\n\u003ch2\u003eModels\u0026rsquo; quality and accuracy\u003c/h2\u003e\n\u003cp\u003eAll models were built with ColabFold [\u003cspan class=\"CitationRef\"\u003e4\u003c/span\u003e] as described in the Methods. This software, like ALphaFold2 [\u003cspan class=\"CitationRef\"\u003e2\u003c/span\u003e], computes a variable called pLDDT for each residue that represents the expected trustworthiness of the prediction. pLDDT values range from 0 to 100, and if pLDDT\u0026thinsp;\u0026gt;\u0026thinsp;90, the positions of both main-chain and side-chain atoms are considered to be reliably predicted; if 70\u0026thinsp;\u0026lt;\u0026thinsp;pLDDT\u0026thinsp;\u0026lt;\u0026thinsp;90, errors are mostly possible in side-chain atoms; if 50\u0026thinsp;\u0026lt;\u0026thinsp;pLDDT\u0026thinsp;\u0026lt;\u0026thinsp;70, errors are also possible in main-chain atoms; and if pLDDT\u0026thinsp;\u0026lt;\u0026thinsp;50, predictions are essentially unreliable - in other words, predictions were impossible.\u003c/p\u003e\n\u003cp\u003eHighly reliable models were possible with one exception only. The average pLDDT value is very close to or larger than 90 for nearly all models (Fig.\u0026nbsp;1a) and the distribution of the pLDDT values shows that they are very seldom smaller than 90 despite a small peak close to 40 (Fig.\u0026nbsp;1b).\u003c/p\u003e\n\u003cp\u003eOnly one of the models is unreliable. Its average pLDDT is very small (ca. 39) and well below the threshold value (50) under which predictions are considered to be impossible. Not surprisingly, in this case, the difference between the computational model and the crystal structure is enormous, too (rmsd\u0026thinsp;=\u0026thinsp;18 \u0026Aring;).\u003c/p\u003e\n\u003cp\u003eThis is an antifreeze protein from \u003cem\u003eHypogastrura harveyi\u003c/em\u003e (2pne) that folds in a compact structure similar to a cuboid parallelepiped, the two largest and opposite faces of which are formed by three antiparallel left-handed polyproline II helixes. There are many hydrogen bonds (342), more than 4 per residue. We are not aware of any research on AlphaFold2 or ColabFold's capacity to forecast polyproline helices; nevertheless, this could be a shortcoming of both AI-based computational approaches and the incorrect AI-based modelling of antifreeze protein fold has already be observed [\u003cspan class=\"CitationRef\"\u003e33\u003c/span\u003e].\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\n\u003ch2\u003eComparison between N-O contacts\u003c/h2\u003e\n\u003cp\u003eThe contacts between a nitrogen and an oxygen atom of the main-chain shorter than 4 \u0026Aring; were considered. As stated in the Methods, findings were remarkably comparable whether this threshold was raised or lowered by 0.5 \u0026Aring;.\u003c/p\u003e\n\u003cp\u003eAccording to Eq.\u0026nbsp;(\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e), t-values were calculated for each pair of experimental and computational contacts, and if t\u0026thinsp;\u0026gt;\u0026thinsp;2.576 (0.99 probability threshold), the experimental and computational distances were deemed to be different.\u003c/p\u003e\n\u003cp\u003eThe average pLDDT of the two atoms was associated with nitrogen-oxygen contacts.\u003c/p\u003e\n\u003cp\u003eThe average t-values estimated at various pLDDT levels are much bigger than 2.576, as shown in Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e, and they increase as pLDDT lowers as expected.\u003c/p\u003e\n\u003cp\u003eFurthermore, a considerable percentage of the experimental N-O contacts differ statistically from their computationally modelled counterparts and, as expected, as pLDDT decreases, the percentage of cases where the difference is statistically significant rises.\u003c/p\u003e\n\u003cp\u003eSecondary structure and solvent accessibility were also used to classify N-O contacts. The fraction of experimental N-O contacts that vary significantly from their computationally simulated equivalents varies marginally with secondary structure and is mostly independent of the residues' relative solvent exposed surface area. On the contrary, when the atomic solvent accessible surface area of the atoms grows, this proportion falls (Table\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e2\u003c/span\u003e). This implies that the experimental and modelled distances tend to statistically equalize for atoms that are more exposed to the solvent, which may simply be due to the fact that these atoms are more exposed and hence have greater B-factors and, as a result, larger positional standard errors (Eq.\u0026nbsp;\u003cspan class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e\n\u003cp\u003eThe crystal structures considered in the present study are the most accurate protein crystal structures determined to date. In addition to the extraordinarily high resolution that is comparable to that achievable in small asymmetric unit crystallography, the absence of structurally undetermined protein sections makes these crystal structures the \u003cem\u003ecr\u0026egrave;me de la cr\u0026egrave;me\u003c/em\u003e of the protein structural data currently available.\u003c/p\u003e\n\u003cp\u003eAlthough, as illustrated above, the computational models are certainly of good quality, they appear to be less accurate than these top-quality crystal structures. If a chemical understanding of the structure is required, with an accurate description of the stereochemistry and of the energy associated with it, achieving experimental high resolution is still preferable to employing computer models generated with artificial intelligence.\u003c/p\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"char\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"char\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"char\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003ctable id=\"Tab1\" border=\"1\"\u003e\u003ccaption\u003e\n\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\n\u003cdiv class=\"CaptionContent\"\u003e\n\u003cp\u003eAnalyses of t-values computed with Eq.\u0026nbsp;(\u003cspan class=\"InternalRef\"\u003e4\u003c/span\u003e). Number of observations (obs), average value of t, with standard deviation in parentheses, (ave and std, respectively) and percentage of times t\u0026thinsp;\u0026gt;\u0026thinsp;2.576.\u003c/p\u003e\n\u003c/div\u003e\n\u003c/caption\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003e\u003cem\u003epLDDT range\u003c/em\u003e\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003e\u003cem\u003eobs\u003c/em\u003e\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003e\u003cem\u003eave (std)\u003c/em\u003e\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003e\u003cem\u003epc\u003c/em\u003e\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003eany\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e2415\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e12.1 (1.0)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e40.2\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003epLDDT\u0026thinsp;\u0026gt;\u0026thinsp;90\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e2224\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e4.9 (0.3)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e39.4\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e70\u0026thinsp;\u0026lt;\u0026thinsp;pLDDT\u0026thinsp;\u0026le;\u0026thinsp;90\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e55\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e23.1 (7.8)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e39.6\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e50\u0026thinsp;\u0026lt;\u0026thinsp;pLDDT\u0026thinsp;\u0026le;\u0026thinsp;70\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e11\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e44.2 (17.2)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e57.9\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003epLDDT\u0026thinsp;\u0026le;\u0026thinsp;50\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e125\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e198.6 (22.8)\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e60.4\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003cdiv class=\"gridtable\"\u003e\n\u003cdiv class=\"colspec\" align=\"left\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003cdiv class=\"colspec\" align=\"char\"\u003e\u0026nbsp;\u003c/div\u003e\n\u003ctable id=\"Tab2\" border=\"1\"\u003e\u003ccaption\u003e\n\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\n\u003cdiv class=\"CaptionContent\"\u003e\n\u003cp\u003ePercentages (pc) of N-O contacts where the experimental and the modelled distances are significantly different at increasing values of the average solvent accessible surface area (SASA, \u0026Aring;) of the nitrogen and oxygen atoms.\u003c/p\u003e\n\u003c/div\u003e\n\u003c/caption\u003e\n\u003cthead\u003e\n\u003ctr\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003eSASA (\u0026Aring;)\u003c/p\u003e\n\u003c/th\u003e\n\u003cth align=\"left\"\u003e\n\u003cp\u003epc\u003c/p\u003e\n\u003c/th\u003e\n\u003c/tr\u003e\n\u003c/thead\u003e\n\u003ctbody\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e0\u0026ndash;1\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e43.7\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e1\u0026ndash;2\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e35.7\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e2\u0026ndash;3\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e32.3\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e3\u0026ndash;4\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e33.5\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e4\u0026ndash;5\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e33.1\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e5\u0026ndash;6\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e29.5\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e6\u0026ndash;7\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e32.2\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e7\u0026ndash;8\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e34.7\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd align=\"left\"\u003e\n\u003cp\u003e8\u0026ndash;9\u003c/p\u003e\n\u003c/td\u003e\n\u003ctd align=\"char\" char=\".\"\u003e\n\u003cp\u003e26.5\u003c/p\u003e\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/tbody\u003e\n\u003c/table\u003e\n\u003c/div\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\n\u003ch2\u003eEstimated standard errors of the atomic positions in computational models\u003c/h2\u003e\n\u003cp\u003eEquations\u0026nbsp;(\u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan class=\"InternalRef\"\u003e8\u003c/span\u003e) allow the determination of the estimated standard error of the atomic position in computational models (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({ese}_{coordinate,mod}\\)\u003c/span\u003e\u003c/span\u003e). This is the smallest coordinate error required to equalize the experimental and computed N-O distances.\u003c/p\u003e\n\u003cp\u003eThis analysis was performed only on two categories of N-O contacts, those with average pLDDT\u0026thinsp;\u0026gt;\u0026thinsp;90 and those with 70\u0026thinsp;\u0026lt;\u0026thinsp;pLDDT\u0026thinsp;\u0026le;\u0026thinsp;90. Three factors led to the rejection of lower pLDDT values: first, there are few observations of N-O contacts with pLDDT\u0026thinsp;\u0026le;\u0026thinsp;70; second, given that there is little confidence in the forecast, these examples are not noteworthy; third, nearly all the observations with pLDDT\u0026thinsp;\u0026le;\u0026thinsp;50 are concentrated in a single structure that was impossible to model (2pne) as described above.\u003c/p\u003e\n\u003cp\u003eThe \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({ese}_{coordinate,mod}\\)\u003c/span\u003e\u003c/span\u003e were computed in this way for each N-O contact, by assuming that the nitrogen and the oxygen atom have the same \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({ese}_{coordinate,mod}\\)\u003c/span\u003e\u003c/span\u003e, and then averaged. As expected, their values, shown in \u003cstrong\u003eFig.\u0026nbsp;2\u003c/strong\u003e (upper part), are smaller for higher levels of pLDDT and larger as pLDDT decreases. In any case, they are much larger than the estimated standard errors of the coordinates in the crystal structures, which have an average value of 0.0149 (\u0026plusmn;\u0026thinsp;0.0001) \u0026Aring;. This is also evident in the distributions of the estimated standard errors for the experimental and modelled structures (Fig.\u0026nbsp;2, lower part): The ones from experiments are almost always smaller than 0.03, while the ones from models can be much bigger.\u003c/p\u003e\n\u003cp\u003eSimilarly, it is possible to compute the \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\({ese}_{coordinate,mod}\\)\u003c/span\u003e\u003c/span\u003e through normal probability plots (Npp) as described in the Methods. In this case, all N-O distances observed in the crystal structures are compared to their counterpart in the computational models to get an average estimated standard error, the value of which is reported in \u003cstrong\u003eFig.\u0026nbsp;2\u003c/strong\u003e (upper part). For the N-O contacts that have an average pLDDT larger than 90, the Npp estimated standard error is slightly larger than that computed with equations (\u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e\u0026ndash;\u003cspan class=\"InternalRef\"\u003e8\u003c/span\u003e). On the contrary it is slightly smaller for the N-O contacts in the 70\u0026ndash;90 pLDDT range. Both these values remain, however, much larger that the estimated standard errors of the crystal structures.\u003c/p\u003e\n\u003cp\u003eEventually, the estimated standard errors of the atomic positions in the models built with ColabFold are 3.5-6 times larger than the estimated standard errors of the atomic positions in the crystal structures examined in the present study.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"Conclusions","content":"\u003cp\u003eShort backbone N-O distances found in high-resolution crystal structures were compared with those found in three-dimensional models created using techniques based on artificial intelligence (AI) AlphaFold/ColabFold, by explicitly taking into account their estimated standard errors.\u003c/p\u003e \u003cp\u003eIt has been observed that experimental and computationally modeled distances often vary significantly, indicating that these models' accuracy is insufficient to replicate experimental findings.\u003c/p\u003e \u003cp\u003eIt was possible to determine, by using t-tests and normal probability plots, that the estimated standard errors of the atomic positions predicted by these computational techniques are 3.5\u0026ndash;6 times greater than the experimental errors.\u003c/p\u003e"},{"header":"Declarations","content":"\u003col style=\"list-style-type: lower-alpha;\"\u003e\n\u003cli\u003eEthics approval and consent to participate: Not applicable.\u003c/li\u003e\n\u003cli\u003eConsent for publication: Not applicable.\u003c/li\u003e\n\u003cli\u003eAvailability of data and materials: All data are taken from the Protein Data Bank; further details are available on request.\u003c/li\u003e\n\u003cli\u003eCompeting interests: No competing interests.\u003c/li\u003e\n\u003cli\u003eFunding\u0026nbsp;: Not applicable.\u003c/li\u003e\n\u003cli\u003eAuthors' contributions: OC, who is the only Author, designed and performed all the analyses and wrote the manscript.\u003c/li\u003e\n\u003cli\u003eAcknowledgments: A. Corelli is gratefully acknowledged for constant support and K. Djinović for helpful discussion. Dr. K. Sekar (Indian Institute of Science, Bangalore) is gratefully acknowledged for his help in computing positional standard errors in protein crystal structures. The author acknowledges support from the Ministero dell\u0026rsquo;Universit\u0026agrave; e della Ricerca (MUR) and the University of Pavia through the program \u0026ldquo;Dipartimenti di Eccellenza 2023\u0026ndash;2027\u0026rdquo;.\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eCarugo O, Djinović-Carugo K. Structural biology: A golden era. PLoS Biol. 2023;21:e3002187.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596:584\u0026ndash;9.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBaek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science. 2021;373:871\u0026ndash;6.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMirdita M, Sch\u0026uuml;tze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: making protein folding accessible to all. Nat Met. 2022;19:679\u0026ndash;82.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123\u0026ndash;30.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBernstein FC, Koetzle TF, Williams GJB, Meyer EFJ, Brice MD, Rodgers JR, et al. The Protein Data Bank: a computer-based archival file for macromolecular structures. J Mol Biol. 1977;112:535\u0026ndash;42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBerman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H et al. The Protein Data Bank. Nucleic Acids Res. 2000;28:235\u0026ndash;42. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve\u003c/span\u003e\u003cspan address=\"http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u0026amp;db=PubMed\u0026amp;dopt=Citation\u0026amp;list_uids=10592235.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ewwPDB Consortium. Protein Data Bank: The single global archive fro 3D macromolecular structural data. Nucleic Acids Res. 2019;47:D520\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eVaradi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucl Acids Res. 2022;50:D439\u0026ndash;44.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eEvans R, O\u0026rsquo;Neill M, Pritzel A, Antropova N, Senior A, Green T, et al. Protein complex prediction with AlphaFold-Multimer. bioRxiv. 2022\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e2021.10.04.463034. doi:10.1101/2021.10.04.463034\u003c/span\u003e\u003cspan address=\"2021.10.04.463034. doi:10.1101/2021.10.04.463034\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eZhu W, Shenoy A, Kundrotas P, Elofsson A. Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes. Bioinformatics. 2023;39.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eYin R, Feng BY, Varshney A, Pierce BG. Benchmarking AlphaFold for protein complex modeling reveals accuracy determinants. Protein Sci. 2022;31:e4379.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eTramontano A. Protein Structure Prediction: Concepts and Applications. New York: John Wiley \u0026amp; Sons; 2006.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBinder JL, Berendzen J, Stevens AO, He Y, Wang J, Dokholyan NV, et al. AlphaFold illuminates half of the dark human proteins. Curr Opin Struct Biol. 2022;74:102372.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStevens AO, He Y. Benchmarking the Accuracy of AlphaFold 2 in Loop Structure Prediction. Biomolecules. 2022;12:985.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBuel GR, Walters KJ. Can AlphaFold2 predict the impact of missense mutations on structure? Nat Struct Mol Biol. 2022;29:1\u0026ndash;2.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCarugo O, Djinovic-Carugo K. Automated identification of chalcogen bonds in AlphaFold protein structure database files: is it possible? Front Mol Biosci. 2023;10:1155629.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMoore PB, Hendrickson WA, Henderson R, Brunger AT. The protein-folding problem: Not yet solved. Sci (80-). 2022;375:507\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCruickshank DWJ. Remarks about protein structure precision. Acta Cryst. 1999;D55:583\u0026ndash;93.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBlow DM. Rearrangement of Cruickshank\u0026rsquo;s formulae for the diffraction-component precision index. Acta Cryst. 2002;D58:792\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGurusaran M, Shankar M, Nagarajan R, Helliwell JR, Sekar K. Do we see what we should see? Describing non-covalent interactions in protein structures including precision. IUCrJ. 2014;1:74\u0026ndash;81.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDinesh Kumar KS, Gurusaran M, Satheesh SN, Radha P, Pavithra S, Thulaa Tharshan KPS, et al. Online_DPI: a web server to calculate the diffraction precision index for a protein structure. J Appl Cryst. 2015;48:939\u0026ndash;42.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCarugo O. How large B-factors can be in protein crystal structures. BMC Bioinformatics. 2018;19:61. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12859-018-2083-8\u003c/span\u003e\u003cspan address=\"10.1186/s12859-018-2083-8\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDjinovic Carugo K, Carugo O. Missing strings of residues in protein crystal structures. Intrinsically Disord Proteins. 2015;3:1\u0026ndash;7.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGiacovazzo C, Monaco HL, Artioli G, Viterbo D, Ferraris G, Gilli G, et al. Fundamentals of Crystallography. Oxford: Oxford University Press; 2002.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMirdita M, Steinegger M, S\u0026ouml;ding J. MMseqs2 desktop and local web server app for fast, interactive sequence searches. Bioinformatics. 2019;35:2856\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDowdy S, Wearden S, Chilko D. Statistics for research. Hoboken: John Wiley \u0026amp; Sons; 2004.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCruickshank DWJ, Robertson AP. The comparison of theoretical and experimental determinations of molecular structures, with applications to naphthalene and anthracene. Acta Cryst. 1953;6:698\u0026ndash;705.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAbrahams SC, Keve ET. Normal probability plot analysis of error in measured and derived qu\u0026shy;antities and standard deviations. Acta Crystallogr. 1971;A27:157\u0026ndash;61.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHamilton WC, Abrahams SC. Normal probability plot analysis of small samples. Acta Cryst. 1972;A28:215\u0026ndash;8.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHeinig M, Frishman D. STRIDE: A web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res. 2004;32:w500\u0026ndash;2.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHubbard SJ, Thornton JM, NACCESS. Department of Biochemistry and Molecular Biology, University College London. 1993.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLaurents DV. AlphaFold 2 and NMR Spectroscopy: Partners to Understand Protein Structure, Dynamics and Function. Front Mol Biosci. 2022;9:906437.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Accuracy, Artificial intelligence, Estimated standard error, Protein Data Bank, Protein structure prediction","lastPublishedDoi":"10.21203/rs.3.rs-3821040/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-3821040/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eArtificial intelligence (AI) has revolutionized structural biology by predicting protein 3D structures with near-experimental accuracy. Here, short backbone N-O distances in high-resolution crystal structures were compared to those in three-dimensional models based on AI AlphaFold/ColabFold, specifically considering their estimated standard errors. Experimental and computationally modeled distances very often differ significantly, showing that these models' precision is inadequate to reproduce experimental results at high resolution. T-tests and normal probability plots showed that these computational methods predict atomic position standard errors 3.5–6 times bigger than experimental errors.\u003c/p\u003e","manuscriptTitle":"Accuracy of AlphaFold models: Comparison with short N ... O contacts in atomic resolution protein crystal structures","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2024-01-03 06:50:26","doi":"10.21203/rs.3.rs-3821040/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"19ec768a-4841-42d2-af30-81a521776186","owner":[],"postedDate":"January 3rd, 2024","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2024-04-06T14:48:50+00:00","versionOfRecord":{"articleIdentity":"rs-3821040","link":"https://doi.org/10.1016/j.compbiolchem.2024.108069","journal":{"identity":"computational-biology-and-chemistry","isVorOnly":true,"title":"Computational Biology and Chemistry"},"publishedOn":"2024-04-01 14:48:50","publishedOnDateReadable":"April 1st, 2024"},"versionCreatedAt":"2024-01-03 06:50:26","video":"","vorDoi":"10.1016/j.compbiolchem.2024.108069","vorDoiUrl":"https://doi.org/10.1016/j.compbiolchem.2024.108069","workflowStages":[]},"version":"v1","identity":"rs-3821040","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-3821040","identity":"rs-3821040","version":["v1"]},"buildId":"qtupq5eGEP_6zYnWcrvyt","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.