Practical outcomes from CASP16 for users in need of biomolecular structure prediction

preprint OA: closed
Full text JSON View at publisher
Full text 53,824 characters · extracted from preprint-html · click to expand
Practical outcomes from CASP16 for users in need of biomolecular structure prediction | Authorea try { document.documentElement.classList.add('js'); } catch (e) { } var _gaq = _gaq || []; _gaq.push(['_setAccount', 'G-8VDV14Y67G']); _gaq.push(['_trackPageview']); (function() { var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true; ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js'; var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s); })(); Skip to main content Preprints Collections Wiley Open Research IET Open Research Ecological Society of Japan All Collections About About Authorea FAQs Contact Us Quick Search anywhere Search for preprint articles, keywords, etc. Search Search ADVANCED SEARCH SCROLL PROTEINS: Structure, Function, and Bioinformatics This is a preprint and has not been peer reviewed. Data may be preliminary. 27 June 2025 V1 Latest version Share on Practical outcomes from CASP16 for users in need of biomolecular structure prediction Authors : Luciano Abriata 0000-0003-3087-8677 [email protected] and Matteo Dal Peraro Authors Info & Affiliations https://doi.org/10.22541/au.175102407.70404028/v1 Published Proteins: Structure, Function, and Bioinformatics Version of record Peer review timeline 416 views 278 downloads Contents Abstract Information & Authors Metrics & Citations View Options References Figures Tables Media Share Abstract The 16th Critical Assessment of Structure Prediction benchmarked advancements in biomolecular modeling, particularly in the context of AlphaFold 2 and 3 systems. Protein monomer and domain prediction is largely solved, with barely any space for further improvements at the backbone level although modeling local details, irregular regions, and mutational effects remains challenging. For protein assemblies, AF-based methods, especially when expertly guided or enhanced by servers like those from the Yang, Zheng/Zhang, and Cheng labs, show progress, though complex topologies and antibody-antigen interactions (where specialized docking approaches showed promise) are still difficult. Notably, a priori knowledge of stoichiometry significantly aids assembly prediction. Protein-ligand co-folding with AF3 demonstrated strong potential for pose prediction, outperforming many participants and some dedicated docking tools in baseline tests, but ligand affinity prediction is currently totally unreliable. Nucleic acid structure prediction lags considerably, heavily relying on 3D templates and expert human intervention, with AF3 showing notable limitations. Overall, AF3’s modeling capabilities are at or close to the state of the art on all fronts; additionally, it shows slight improvements over AF2 and more detailed confidence metrics. This article guides users on tool selection, realistic accuracy expectations, and persistent challenges, emphasizing the critical role of confidence metrics in interpreting AI-generated models. Practical outcomes from CASP16 for users in need of biomolecular structure prediction jabbrv-ltwa-all.ldf jabbrv-ltwa-en.ldf Running title: Practical outcomes from CASP16 jabbrv-ltwa-all.ldf jabbrv-ltwa-en.ldf Luciano A. Abriata* and Matteo Dal Peraro Laboratory for Biomolecular Modeling and Protein Structure Core Facility, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL) and Swiss Institute of Bioinformatics, CH-1015 Lausanne, Switzerland * [email protected] Abstract: The 16th Critical Assessment of Structure Prediction benchmarked advancements in biomolecular modeling, particularly in the context of AlphaFold 2 and 3 systems. Protein monomer and domain prediction is largely solved, with barely any space for further improvements at the backbone level although modeling local details, irregular regions, and mutational effects remains challenging. For protein assemblies, AF-based methods, especially when expertly guided or enhanced by servers like those from the Yang, Zheng/Zhang, and Cheng labs, show progress, though complex topologies and antibody-antigen interactions (where specialized docking approaches showed promise) are still difficult. Notably, a priori knowledge of stoichiometry significantly aids assembly prediction. Protein-ligand co-folding with AF3 demonstrated strong potential for pose prediction, outperforming many participants and some dedicated docking tools in baseline tests, but ligand affinity prediction is currently totally unreliable. Nucleic acid structure prediction lags considerably, heavily relying on 3D templates and expert human intervention, with AF3 showing notable limitations. Overall, AF3’s modeling capabilities are at or close to the state of the art on all fronts; additionally, it shows slight improvements over AF2 and more detailed confidence metrics. This article guides users on tool selection, realistic accuracy expectations, and persistent challenges, emphasizing the critical role of confidence metrics in interpreting AI-generated models. jabbrv-ltwa-all.ldf jabbrv-ltwa-en.ldf 1. Introduction: context and some important definitions For over 30 years, the Critical Assessment of protein Structure Prediction (CASP) experiment has served as a crucial community-wide benchmark for the field of biomolecular structure prediction. 1 By challenging predictors to model the 3D structures of “targets” (proteins and, more recently, other biomolecules and their complexes) secured before their experimental structures are publicly released, CASP provides an objective assessment of the state of the art. Each CASP edition has chronicled the evolution of prediction methodologies, from early homology modeling, through the era of co-evolutionary contact prediction that sparked around CASP11-CASP12, 2,3 to the deep learning-driven breakthroughs. Notably, CASP13 4 (2018) saw the emergence of DeepMind’s AlphaFold 1, 5 which showcased the potential of AI, while a revolution arrived in CASP14 (2020) with AlphaFold 2 6,7 (AF2) which achieved near-experimental accuracy at the backbone level for most protein domains and even for some full proteins. 6 CASP15 then charted the widespread adoption of AF2-like systems by various predictors who incorporated it into their own systems, and a growing focus towards more complex targets, culminating with AlphaFold 3 (AF3) and other multimodal AI systems. 8–10 The AF2 breakthrough impacted biological research profoundly, powering it directly through its models (produced ad hoc or pre-computed in the EBI database 11,12 ) and also by assisting experimental structural biology in novel ways, 13,14 having essentially solved the problem of modeling the 3D structures of domains and largely advancing that of modeling multiprotein complexes, in turn triggering whole new ways to model and design proteins. 15 CASP15 built upon these advances, finding that most predictors used AF2 in one way or the other, often with refinements such as customized multiple sequence alignments (MSAs, essential for the program to discover contact patterns) and optimized template detection (because all these AI systems still perform better when a template is available). Additionally, CASP15 highlighted new frontiers such as multimeric assemblies and non-protein components, among which complexes involving nucleic acids and small molecules were the most important ones. During CASP16 (2024), AF3, capable of modeling a wider range of biomolecules including proteins, nucleic acids, ions and ligands, including protein post-translational modifications, became available 10 initially as a web server with various limitations and later on (towards the end of the prediction season) with its source code released and thus with various limitations overcome. The main limitations when running AF3 through the server included daily usage quotas, a limited set of available ligands, and no possibility to tweak MSAs or provide specific templates. All this was certainly limiting for the predictors who may have wanted to harness AF3’s full power in their pipelines, and also for CASP’s own use of AF3 as a baseline. As of today, these limitations are still active on the server version; therefore, users should be aware that AF3’s full capabilities can only be exploited through an installed version (links to server and code in Table 1). A last general note on CASP, very relevant to those wanting to make use of its output, is that the contest typically involves ”human groups”, who apply manual intervention and their expert knowledge, and ”server groups”, which act as fully automated methods, some of which become accessible online after CASP. Predictors submit up to five models per target, with their most confident prediction designated as ”model 1”. Two important points then arise. First, typically the human groups perform as well as or better than servers, possibly with the gap slowly closing. Second, all CASP editions have consistently found that predictors are bad at scoring their or other predictors’ models, even when among the models there are some that match the true structure perfectly. In this article we distill the practical outcomes from CASP16, aiming at end-users seeking to apply biomolecular structure prediction software to their research. We focus on identifying which methods and approaches performed best across different prediction categories, their general availability (e.g., web servers, standalone software), realistic expectations of accuracy, and current limitations, summarizing the comprehensive observations reported by the CASP16 assessors. 16–19 Very importantly, the assessors’ papers deal with all predictor groups, drawing only a partial distinction between those that stand as resources that end-users can actually use, those that involve closed pipelines that are in development, and those that participated as human experts. Here we will comment on all when appropriate, but carefully flagging (and focusing on) those that exist as actual tools users can use (providing the links in Table 1). The critical role of accuracy estimates Before we delve into the results, a note of high practical relevance. For a long time, CASP asked predictors to provide not only the 3D models themselves but also some kind of residue-level confidence metric encoded in the B-factor columns of the submitted PDB files. Since most predictors paid little or no attention to this, for a long time these metrics were not considered upon evaluation. However, when we noted in CASP13 that predictions were starting to be quite good, we did consider it in the qualitative part of the assessment, 4 and proposed that submitting confidences per residue should become mandatory. Moreover, in that same paper we introduced the idea that it would also be good to ask predictors for residue-residue confidence metrics, i.e. a value that predicts how well each residue is modeled relative to all others. Our motivation was that some residues could turn out well modeled locally, say in the context of its sequence neighbors, but not in the global 3D context, in which the most valuable interpretations would be done (for example to propose mutations that disrupt contacts, etc.). In summary, the ideal structure prediction tool should produce not only a 3D model but also a global score, which most serious programs already provided as a TM score (although CASP uses GDT-TS rather than TM, they are both quite correlated), plus a residue-wise score (ideally just in the B-factor column of the PDB for rapid visualization), and a residue-residue score provided as a separate matrix. Deepmind fulfilled this in its AF2 system, whose models are accompanied by a global TM score, a per-residue confidence scores called pLDDT (predicted Local Distance Difference Test, as it attempts to predict the “local distance”), and a matrix with a residue-residue confidence metric called pAE after predicted Aligned Error, which attempts to predict what the error would be on a residue when another is aligned to its actual location in the true structure. While pLDDT allows to rapidly tell if a region is well modeled or not and it can quickly reflect regions not covered smoothly in the input MSA as well as pinpoint at flexible loops and disordered regions, 20,21 the pAE matrix is especially useful to estimate the quality of domain-domain arrangements in a protein and of protein-protein interfaces in complexes. AF3 also provides all these confidence metrics, actually including residue-level and atom-level pLDDT plus metrics dedicated to scoring interfaces. Unfortunately, the most used servers for structure prediction, including some ranking at the top in recent CASP editions, do not provide more than a global quality score and in some cases a residue-wise score. We exemplify the role of metrics in a multimodal prediction with AF3 in Figure 1, providing also some hints and online tools to simplify the inspection of confidence metrics. Figure 1. Example of assembly modeling with AlphaFold 3 and of its confidence metrics. (A) Modeling a complex between a peripheral membrane protein with a palmitoylated cysteine (blue), a short integral membrane protein (orange), and 50 lipid molecules included for context (grey), that the program spontaneously assembled into a bilayer-like structure that reflect the true nature of this complex (Golph3-LCS from Theodoropoulos et al***REF). (B) Atom-wise pLDDT traces by chain (higher is better). For proteins and nucleic acids, pLDDT is most often averaged per residue and color-mapped onto a cartoon representation of the 3D model as shown in the inset (blue is low pLDDT, red is high pLDDT; the palmitoylated cysteine and the lipids are colored by pLDDT mapped at atomic level). (C) PAE plot quantifying how reliably each residue was modeled relative to all others in the model (lower is better). Molecular graphics in this figure were rendered with PyMOL 0.99 and the plots were generated from the raw AF3 server outputs with a custom tool available at https://go.epfl.ch/af3scores. Into CASP16 Historically, the prediction of individual protein domain structures was the central challenge in CASP, but since edition 14 when domain modeling turned out so good, the focus shifted more towards whole proteins and to the complexes they form with other proteins and with other kinds of molecules—while always touching on other questions including nucleic acids, conformations, integrative modeling, etc. CASP16 (2024) in particular included nine broad modeling categories; they are explained in detail in another article of this issue 22 but roughly correspond to: protein structure, with separate assessment of monomers, multimers, the effect of (not) knowing stoichiometry, and the use of large precomputed model sets (from MassiveFold, see below); estimations of protein model accuracy (not covered in this article); targets consisting exclusively in RNA and DNA molecules, with separate assessments for monomeric nucleic acids and for multimers with and without stoichiometry information; hybrid complexes (assemblies containing protein/s plus RNA and/or DNA molecules), again with and without stoichiometry information; complexes between proteins and small molecule ligands, with an interesting set of targets provided by pharmaceutical companies and also with regular ligands present in general CASP targets; and prediction of ligand binding affinities. Three additional small, rather anecdotal tests were conducted on predicting multiple conformations, the solvent spatial distribution around an RNA molecule, and the distribution of inter-domain orientations for two proteins domains connected by a flexible linker (none covered here). Naturally, given the relevance of AlphaFold-based approaches, a dominant theme across CASP16 was their central role in modeling proteins and their complexes, either directly or as part of other pipelines. This is why CASP adopted regular AF2 run through ColabFold 23,24 and AF3 run through the server as baselines, a test that upfront showed that naïve usage of AF3 provides close to state-of-the-art (be it good or bad) predictions along all tracks, as detailed (and some caveats analyzed) by Elofsson. 25 Moreover, CASP16 utilized MassiveFold, 26 a parallelization engine that enables massive generation of structural diversity in AF2-based runs, to produce large numbers of AF2 models generated with different seeds, that were then made available to predictors. The goal of this was two-fold: first, to check whether better models than those from standard AF2 and ColabFold runs were produced; and second, to check whether predictors were capable of identifying such models. Additionally, the availability of large numbers of models coming from AF2 via MassiveFold leveled the playing field for resource-limited groups and could hopefully stimulate developments in the crucial area of model quality assessment. Unfortunately, the main conclusion was that although MassiveFold does indeed produce better models, predictors are not very good at identifying them. 27 When using AF2, predictors could tweak its execution by providing specific templates and MSAs, changing the numbers of iterations and starting seeds, and other more exotic modifications such as subsampling to obtain more structural variety (this is also possible now for AF3 if used on a local installation, but it wasn’t available during CASP16). Leading groups indeed employed sophisticated strategies involving MSA optimization (though assessors noted this was perhaps slightly less critical than in CASP15), careful template utilization, enhanced conformational sampling, and meticulous construct refinement. 28,29 ***Other predictor papers These strategies worked well in many cases leading to somewhat better models, with an increment in model accuracy more marked for larger systems and assemblies, while at the domain level the effect was small–mainly because AF2 and AF3 models are already too good and there’s little room for improvement. In this context it is important to note that CASP rankings based on Z-scores amplify small differences, and the absolute quality differences between top methods and baseline ColabFold or AF3 when compared by TM or GDTTS scores are smaller; in particular marginal for well-behaved domains. 2. Modeling of Protein Monomers: Domains and Full Proteins 2.1 State of the art and top performers in CASP16 CASP16 assessors for monomer prediction concluded that the problem of single-domain protein fold prediction is now nearly solved, with no target folds being missed across all defined evaluation units (often roughly corresponding to domains or sets of continuous domains). 19 This success is largely attributed to the continued dominance and refinement of AF2- and AF3-based pipelines. Servers from the Yang lab, the Zheng/Zhang lab and Cheng labs achieved top performance across monomer targets. As general throughout CASP16 for groups capitalizing on AF2 and AF3, their strategies involved meticulous optimization of MSAs, careful selection of ”constructs” (the specific protein sequence fragment used for modeling), and enhanced conformational sampling with AF2 and AF3. 28,29 As explained above, their raw scores on targets aren’t however much higher than AF3’s. In turn, AF3 as a web server as available during the prediction period, demonstrated a small but noticeable advantage over AF2, particularly in its confidence estimation and model selection capabilities. Notably, when ranked by the quality of ”model 1” submissions at the domain level, the AF3 server itself rose to second place being essentially indistinguishable form other top groups, even human experts (see Figure 6A of the monomer assessment paper 19 ). Note that many groups in CASP16 were able to outperform the widely used ColabFold’s implementation of AF2 used as one of the baselines, reflecting an increased community expertise in optimizing AlphaFold-based predictions and the growing adoption of AF3. 2.2 Practical considerations and remaining challenges In summary, then, users needing single protein domain structures should most likely turn to AF3 or possibly to consultation with the top human groups or their servers, as a large number of predictors excelled and they all can be recommended—yet probably the AF3 systems come with the simplest interfaces, fastest run times, the possibility to eventually add other molecular components if needed, and the most detailed outputs including all confidence metrics presented in the introduction and exemplified in Figure 1. AF3 is easy and very fast to run, just online or with a local installation (both linked in Table 1) that allows for much more flexibility in the inputs (available ligands, providing specific templates, etc.) and in controlling execution (numbers of models, seeds, etc.). Naturally, on top, the local installation is not limited in number of daily submissions. On the downside, AlphaFold 3 is only available for non-commercial use, but then multimodal systems of the Chai, Bolt and OpenFold families can be considered (although these were not or very limitedly tested in CASP16). Table 1. URLs corresponding to CASP16 baseline methods and CASP16 servers that performed better than them along at least one track Baselines AlphaFold 3 website 10 https://alphafoldserver.com Limited number of submissions per day; no custom template or MSA; limited set of ligands. AlphaFold 3 code 10 https://github.com/google-deepmind/alphafold3 https://github.com/google-deepmind/alphafold3/blob/main/docs/input.md Full software without limitations on ligands, templates or MSAs. The second link is the official description of all inputs. For online inspection of quality scores, check https://go.epfl.ch/af3scores. AlphaFold 2 via ColabFold 6,7,23,24 https://github.com/sokrypton/ColabFold “AlphaFold2_mmseqs2” notebook recommended, which automatically uses AF2-multimer if needed. Full control on templates and MSAs. Google Colab Notebook needs some effort to use and depends on cores being available; limited number of runs per day. Methods that ran in automated fashion for CASP16 (“servers” with their official CASP16 names) MIEnsembles-Server (REF***) https://seq2fun.dcmb.med.umich.edu/MIEnsembles-Server/ Automatic stoichiometry prediction. Providing models runs scoring and ranking. Yang-Server and Yang-Multimer (REF***) https://yanglab.qd.sdu.edu.cn/trRosetta/ CASP16 version will be put up late 2025. MULTICOM_X 28 https://github.com/BioinfoMachineLearning/MULTICOM4 MULTICOM4 is the structure prediction engine for monomers and multimers, while the various MULTICOM_X programs rank and select models in different ways Despite the overall success, challenges persist. First, while folds are largely correct, high local accuracy (e.g., side-chain conformations, loop regions) is not guaranteed. The monomer assessors noted that the improvement from CASP15 to CASP16 regarding these aspects was very subtle. 19 Second, truncated sequences, irregular secondary structures (e.g., bent helices, over-stabilization of helices/strands in flexible regions), and conformations induced by interchain interactions (for monomers extracted from complexes) remain difficult to model accurately. Similarly, there is virtually no chance that uncommon isomers (for example cis vs. trans at peptide bonds) be properly modeled, that the effects of single or few mutations be captured, or that large regions with irregular structures be folded accurately. The paper assessing monomer modeling shows some examples of these failures in its Figure 3. 19 Of note, AF3’s capability to model post-translational modifications and their effects on protein structure have been barely tested, and are most likely very limited as some reports have proposed. 30 A note valid for all best tools out there is that while deeper MSAs generally correlate with better accuracy, AF2 and AF3 can for some targets produce good models even with shallower or no MSAs. However, users should be aware that for proteins with very shallow MSAs, such as viral and some eukaryotic proteins, modeling might be more challenging as found in previous and the latest CASP. 19 Another conclusion from the monomer assessment paper was that for modeling portions of multi-domain proteins, or proteins that are part of larger complexes, careful definition of the domain boundaries or the segment to be modeled are crucial. Finally, as we already mentioned, model ranking (that is selecting the single best model from among multiple predictions) remains a general weakness, and while the confidence metrics provide a very good guide, they aren’t infallible. Sampling various alternative models can therefore be useful in certain situations. 3. Modeling of Protein Assemblies and Complexes Predicting the structure of protein complexes (oligomers) is a frontier very important in fundamental and applied biology, on which CASP has been pushing for especially in the last 3 editions. CASP16 featured several protein-only complexes as well as complexes of proteins with nucleic acids, and naturally some ligands were involved too—but separate, somewhat overlapping, assessments were carried out for the different kinds of complexes, as explained in the introduction and detailed in a dedicated paper of this issue. 22 3.1 State of the art and top performers The multimer assessors concluded that complex structure prediction remains an unsolved challenge, with 50% of the targets not very well predicted and over 30% of the total targets proving highly difficult. 17 This difficulty was particularly evident for assemblies with novel interfaces lacking co-evolutionary signals or template information. Nevertheless, moderate overall improvement was seen compared to CASP15. Most participating groups used AlphaFold-Multimer (AFM, which becomes the default in the ColabFold notebook “AlphaFold2_mmseqs2” when multiple sequences are submitted) or the AF3 server at the core of their modeling engines. Notably, the top-performing groups here did significantly outperform default AFM/AF3 predictions, and they achieved this by using optimized MSAs, refined constructs that allowed modeling large complexes in pieces, employing extensive model sampling, and applying specialized model selection techniques. Unfortunately, though, some are expert-based and do not exist as an automated, integrated piece of software or server. Among servers or downloadable software that participated as automated methods (“servers”) in CASP16, the top performers (above AF3) include some of the MULTICOM servers, Yang-Multimer and Yang-Server, and the MIEnsembles-Server, all linked in Table 1. These servers showed a small but sizable improvement over AF3, as shown in Figure 3 of the paper assessing protein multimer modeling. 17 3.2 Practical considerations and remaining challenges when modeling complexes All tools indicated above (and linked in Table 1) are the recommended ones for protein multimer modeling as of CASP16, with the caveat that, as explained, multimer modeling is far from perfect and certainly not as good as modeling of domains. One point is very important to stress, and it is good that a dedicated CASP16 experiment looked into this: knowing the right stoichiometry when predicting the structure of a complex makes a big positive impact. Turns out that CASP16 started with what was called “Phase 0”, whereby the sequences of the molecules involved in the complex were provided but not the stoichiometries, and was followed by “Phase 1” were predictors were also given the stoichiometries. This experiment clearly demonstrated that providing the correct stoichiometric information significantly improves the accuracy of predicted assemblies, as detailed by the assessors in Figure 5 of their paper. 17 Users should then try to leverage experimental data (e.g., SEC-MALS, native mass spectrometry) to guide modeling whenever possible. Among specific challenges, it was clear in CASP16 that very large or topologically intricate complexes, especially those with novel interfaces or lacking good templates, are still poorly predicted. Very big targets were also problematic, in which case strategies like ”divide and conquer” modeling subcomplexes or domains separately and then assembling them were common and worked in some but not all cases. Notably, some kinds of complexes remain particularly difficult to model, such as Antibody-Antigen (AA) complexes and even if not too large. The kozakovvajda group, using a pipeline based on their ClusPro docking server [14] (which is publicly available) augmented with other techniques (rather than primarily AFM/AF3 for the direct AA interface prediction), significantly outperformed other groups, including extensively sampled AF3, on AA targets (see Figure 4 of the paper assessing protein multimer modeling 17 ). These participants were human only, and these methods aren’t available off-the-shelf online as of mid-2025. Of particular relevance when studying interactions within a complex, achieving high atomic accuracy at protein-protein interfaces (crucial for applications like interface-targeted drug design or to design mutations that will disrupt a signal, etc.) remains a significant hurdle. Moreover, sometimes even when the models have a rather good overall score, the interfaces might be somewhat far from perfect, and vice versa . 17 Unfortunately, part of this is due to far-from-perfect model quality estimation and model selection. To assist these modeling tasks, interface confidence scores such as AF3’s pTM and ipTM (interface-specific pTM) are very useful together with the PAE plots, yet far from infallible (see study by Dunbrack 31 ). A strategy that might work involves running the prediction protocols with multiple seeds, to generate structural variability to be considered downstream in the framework of what’s already known, expected, or more senseful for the system. 3.3 Modeling hybrid (protein-nucleic acid) complexes CASP16 featured an increased number of ”hybrid” targets containing both proteins and nucleic acids (DNA/RNA). These were assessed by both multimer and nucleic acid assessors. From the perspective of protein assembly assessment, 17 the top-performing groups for the protein-NA interfaces included mainly human predictors, plus some of the automated systems listed in Table 1. The baseline AF3 server ranked further behind these specialized groups for hybrid targets. Overall, modeling protein-NA complexes and interfaces was generally even more challenging than for the protein-protein case, including for the AF3 server. From the very practical point of view, all caveats disclosed, as of the date of submission of this paper only AF3 can natively model proteins and nucleic acids together in a single shot. 4. Modeling of protein-ligand complexes and their affinities A very interesting highlight of CASP16 was a dedicated track for predicting protein-small molecule (ligand) interactions, focusing on pharmaceutically relevant drug-like compounds and involving both pose prediction (3D structure of the complex) and binding affinity prediction. 18 This track, made possible by collaborations with pharmaceutical companies (Hoffmann-La Roche, Idorsia Pharmaceuticals) and the Structural Genomics Consortium, represented the most pharma-relevant ligand prediction challenge in CASP history. Despite being the most serious test along these lines even in CASP, it must be noted that the dataset included only a handful of proteins in 229 protein-ligand target structures. The main part of the assessment, briefly summarized here, focused on the drug-like ligands present in the dataset, all with binding sites and poses well-defined in the 3D structures, and in many cases also counting with affinities determined as part of industrial drug discovery projects. We touch here on the 3D modeling and affinity predictions for these ligands. Separately, the assessment paper 18 analyzes modeling of “incidental” ligands coming from cofactors, crystallization agents, etc., but we don’t comment on them here. 4.1 Protein-ligand pose prediction For predicting the binding pose of drug-like ligands, template-based methods performed well, but only when templates were available. The best groups as in the official ranking achieved a mean LDDT-PLI (a measure that blends ligand and pocket accuracy ranging from 0 for bad poses to 1 for perfect poses) of 0.69 for ClusPro or slightly lower for the rest; however, they were all human groups. 18 Post-CASP, the assessors ran a set of automated baseline methods that were not available to participants during the challenge. Strikingly, AF3 run locally as a co-folding method (i.e. in which protein sequence and ligand SMILES are input together for concurrent modeling; more on this later on) achieved a mean LDDT-PLI of 0.80, outperforming all CASP16 participating groups. Boltz-1 and RoseTTAFold-AllAtoms, two other multimodal systems used here for protein-ligand modeling by co-folding method, performed well behind AF3, with mean LDDT-PLI values of 0.52 and 0.37, respectively. In one particularly remarkable case, AF3 predicted very accurately the pose of a ligand bound to a protein even though there’s no similar binding site or close structural homologues in the PDB. The baseline testing conducted in CASP16 also included standard docking with AutoDock Vina, which showed a relatively low performance close to that observed for RoseTTAFold-AllAtoms. Note however that many leading developers of widely-used academic and commercial docking suites (AutoDock Vina tested by CASP plus regular AutoDock, Glide, etc.) did not participate in this CASP16 challenge, making direct comparisons to the full spectrum of established methods difficult, and certainly leaving place for experts in these docking-specific tools to perform much better than CASP’s naïve runs. On looking at the best modeled protein-ligand poses, largely coming from AF3, the assessors found that larger ligands (i.e. containing more atoms), more flexible ligands (containing more rotatable bonds) and ligands less similar to small molecules present in the PBD, all show negative correlation with prediction accuracy in the CASP16 pharma dataset. 18 The recommendation that emerges from this CASP16 track about software to model protein-ligand complexes, is that AF3 emerges as the best, but given the various caveats discussed: predictions look good but are not perfect and are certainly varied, the dataset used for the benchmark was very limited, software specific for docking might need to be better tested, and a local AF3 installation is required in order to really access modeling of any ligand. More on co-folding proteins and ligands The ”co-folding” approach possible with multimodal structure prediction AI systems like AF3, RoseTTAFold-AllAtoms, Chai-1 and Boltz-1/2, where the protein and ligand are modeled simultaneously allowing for mutual conformational adaptation, could provide a key advantage over traditional docking software where the protein’s structure must be known (or modeled) beforehand and is often treated as rigid or in the best case semi-flexible. However, despite AF3’s apparently good performance in this area as evaluated in CASP16, two independent broader studies showed that the program relies largely on memorization, without much capability to understand the actual protein-ligand interactions or doing good predictions for systems that are too far from those used in training. 32,33 Notably, the limitations are stronger for ligands that have only been seen binding in one pocket, whereas more promiscuous ligands such as cofactors show moderately improved performance. This would mean that AF3 is much safer as a tool to model protein-ligand complexes when reasonable templates exist in the PDB; and together with the caveats discussed above it is clear that co-folding must for the moment be exercised with caution. 4.2 Protein-Ligand Affinity Prediction Predicting binding affinities proved extremely challenging, with very small correlation with the experimentally measured values. 18 Moreover, providing the actual experimental protein-ligand complex structures to predictors in a second stage did not improve affinity prediction accuracy. This strongly suggests that the primary limitation lies on the scoring functions used by current methods, rather than on inaccuracies in the predicted poses themselves. Simple ligand descriptors like molecular weight correlated with experimental affinities as well as, or sometimes better than, sophisticated computational methods. Docking scores from programs like AutoDock Vina or GNINA were not good predictors of affinity. It should be acknowledged, however, that standard docking scores are generally not optimized for, nor primarily intended for, quantitative affinity ranking. Additionally, as stated earlier, the community of researchers working specifically on predicting protein-ligand complexes and their affinities, or the latest Boltz-2 model purportedly capable of modeling complexes and predicting affinities, were not well represented among CASP16 predictors. 5. Modeling of nucleic acids and their complexes Nucleic acid (NA) structures, particularly RNA structures, are poorly covered in the PDB, with growing experimental efforts 34–36 that would benefit enormously from better computational predictions. CASP16 featured its largest-ever set of NA targets, including DNA and RNA monomers, NA-NA multimers, and NA-protein/ligand complexes. 16 This allowed for the deepest-ever evaluation of structure prediction systems at modeling this very important class of molecules and complexes. 5.1 State of the art and top performers CASP16 showed solidly that NA structure prediction accuracy lags significantly behind that for proteins. No predictions of previously unseen natural RNA structures achieved a TM-score above 0.8, a threshold considered to point at well-defined structures. 16 This disparity highlights that strategies successful for proteins have not yet translated effectively to NAs, which still seem to be stuck at a “pre-AlphaFold” era. Like for proteins before AF2 came out, accuracy in NA modeling is highly dependent on the availability of closely related 3D structural templates in the PDB, and there is little to no hope for accurate modeling of targets lacking templates. 16 Unfortunately, then, compared to previous CASP editions 37 or RNA-Puzzles 38 challenges, CASP16 did not show a notable increase in overall NA modeling accuracy. The good point is that the large number of targets available, covering various types of complexes, allowed for a clear investigation of what works and what doesn’t, as detailed in the assessors’ report 16 and also by the structure providers themselves who tested how well the best models could have (not) replaced the experimental structures. 39 Seven predictor groups performed above the baseline AF3, of which only Yang-Server is a server, and all others are human groups. The differences compared to the baseline are however rather small, so the servers are probably just the way to go, provided good templates are available for the NA parts of the modeling. Note that a major limitation of AF3 (at least its initial version/server) is that it does not perform template searches for nucleic acids, hindering its performance on template-amenable targets; the assessors suggested that human predictors, and even the automated Yang-Server, may be better able to use template information. 16 5.2 Specific challenges in nucleic acid modeling Detailed features on NA pseudoknots, singlet Watson-Crick base pairs, non-canonical pairs, and specific tertiary motifs such as A-minor interactions (a type of tertiary interaction where an adenine residue interacts with the minor groove of a nearby RNA helix) are all very hard to model, which is a pity because often they are at the center of special functions, as the structure providers indicated. 39 And given that coarser details about the 3D topology are so poorly predicted, it is at the moment somewhat pointless to consider these subtle structural features. On a positive note, the secondary structures (base-pairing patterns) predicted by the 3D modeling groups in CASP16 were often quite accurate, even when the whole 3D topology was rather wrong, in many cases outperforming dedicated secondary structure prediction algorithms that do not build 3D models. This is potentially useful for those in need of RNA secondary structure assignments only. Predicting stoichiometry for RNA multimers (in Phase 0, as commented earlier in the section of protein-containing multimers) was very poor; and then even with known stoichiometry, symmetry prediction was challenging. RNA-RNA interface prediction was generally largely inaccurate, except when very close templates were available. Modeling of NA-protein complexes also remain difficult, as discussed in section 3.3. Good predictions rely largely on templates for both the protein and NA components, and for their interface. A few CASP16 targets entailed complexes between NAs and small molecule ligands. For ZTP-riboswitches where good RNA and ligand pose templates existed, some groups made quite accurate predictions, but they were human participants. Meanwhile, for novel NA-ligand targets, poor NA structure prediction precluded accurate ligand pocket modeling. 5.3 Practical advice when predicting nucleic acid structures Reliable de novo prediction of NA structures is currently not feasible, especially for targets lacking templates, and with little or no hope for complex topologies as well as non-canonical details. The best chance for a useful model comes from homology modeling if a suitable 3D template exists, possibly with Yang-Server or AF3 but carefully checking against possible templates. Last, of potential interest for certain applications such as predicting RNA secondary structure only, 3D prediction methods can provide valuable insights, even if the 3D model itself is not very good. 6. Concluding remarks: summary of key points and the route forward CASP16 has once again provided an invaluable snapshot of the capabilities and limitations of biomolecular structure prediction. The profound impact of AlphaFold and related AI methodologies continues to unfold. For most applications, AF3 is probably just the way to go, with all caveats discussed; and it is very easy to use at least in its server version (with the limitations described). For specific applications and for flexibility, other tools mentioned throughout the text must be considered. Recapping on the limitations, the prediction of complex macromolecular assemblies, while advancing, still presents significant hurdles, particularly for highly intricate systems, in the absence of good templates for modeling, and for systems involving nucleic acids. The critical role of experimental information, such as stoichiometry for complexes or 3D templates for nucleic acids, was starkly highlighted. Specialized areas like antibody-antigen docking (where some traditional docking methods still hold an edge) and protein-ligand pose prediction (where AF3 shows immense promise as a co-folding method) are evolving rapidly, but still have a good way to go. Finally, we have underscored that while AI tools and web-based implementations have democratized access to state-of-the-art, easy-to-use modeling tools, it is still important to count with expert interpretation, understanding of method-specific limitations, and critical evaluation of confidence metrics. jabbrv-ltwa-all.ldf jabbrv-ltwa-en.ldf 7. References 1. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K. & Moult, J. Critical assessment of methods of protein structure prediction (CASP)—Round XV. Proteins Struct. Funct. Bioinforma. 91 , 1539–1549 (2023).2. Abriata, L. A., Tamò, G. E., Monastyrskyy, B., Kryshtafovych, A. & Dal Peraro, M. Assessment of hard target modeling in CASP12 reveals an emerging role of alignment-based contact prediction methods. Proteins 86 Suppl 1 , 97–112 (2018).3. Kinch, L. N., Li, W., Monastyrskyy, B., Kryshtafovych, A. & Grishin, N. V. Assessment of CASP11 contact-assisted predictions. Proteins 84 Suppl 1 , 164–180 (2016).4. Abriata, L. A., Tamò, G. E. & Dal Peraro, M. A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments. Proteins Struct. Funct. Bioinforma. 87 , 1100–1112 (2019).5. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577 , 706–710 (2020).6. Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).7. Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. 2021.10.04.463034 Preprint at https://doi.org/10.1101/2021.10.04.463034 (2022).8. Wohlwend, J. et al. Boltz-1 Democratizing Biomolecular Interaction Modeling. BioRxiv Prepr. Serv. Biol. 2024.11.19.624167 (2024) doi:10.1101/2024.11.19.624167.9. Generalized biomolecular modeling and design with RoseTTAFold All-Atom | Science. https://www.science.org/doi/abs/10.1126/science.adl2528.10. Abramson, J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 630 , 493–500 (2024).11. Varadi, M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res. 50 , D439–D444 (2022).12. Varadi, M. et al. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 52 , D368–D375 (2024).13. Akdel, M. et al. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29 , 1056–1067 (2022).14. Kovalevskiy, O., Mateos-Garcia, J. & Tunyasuvunakool, K. AlphaFold two years on: Validation and impact. Proc. Natl. Acad. Sci. 121 , e2315002121 (2024).15. Abriata, L. A. The Nobel Prize in Chemistry: past, present, and future of AI in biology. Commun. Biol. 7 , 1–3 (2024).16. Kretsch, R. C. et al. Assessment of nucleic acid structure prediction in CASP16. 2025.05.06.652459 Preprint at https://doi.org/10.1101/2025.05.06.652459 (2025).17. Zhang, J. et al. Assessment of Protein Complex Predictions in CASP16: Are we making progress? 2025.05.29.656875 Preprint at https://doi.org/10.1101/2025.05.29.656875 (2025).18. Gilson, M. et al. Assessment of Pharmaceutical Protein-Ligand Pose and Affinity Predictions in CASP16.19. Yuan, R. et al. CASP16 protein monomer structure prediction assessment. 2025.05.29.656942 Preprint at https://doi.org/10.1101/2025.05.29.656942 (2025).20. Omidi, A., Møller, M. H., Malhis, N., Bui, J. M. & Gsponer, J. AlphaFold-Multimer accurately captures interactions and dynamics of intrinsically disordered protein regions. Proc. Natl. Acad. Sci. 121 , e2406407121 (2024).21. Piovesan, D., Monzon, A. M. & Tosatto, S. C. E. Intrinsic protein disorder and conditional folding in AlphaFoldDB. Protein Sci. Publ. Protein Soc. 31 , e4466 (2022).22. Kryshtafovych, A. et al. Updates to the CASP infrastructure in 2024.23. Mirdita, M. et al. ColabFold: making protein folding accessible to all. Nat. Methods 19 , 679–682 (2022).24. Kim, G. et al. Easy and accurate protein structure prediction using ColabFold. Nat. Protoc. 20 , 620–642 (2025).25. Elofsson, A. AlphaFold3 at CASP16. 2025.04.10.648174 Preprint at https://doi.org/10.1101/2025.04.10.648174 (2025).26. Raouraoua, N. et al. MassiveFold: unveiling AlphaFold’s hidden potential with optimized and parallelized massive sampling. Nat. Comput. Sci. 4 , 824–828 (2024).27. Raouraoua, N., Lensink, M. F. & Brysbaert, G. MassiveFold data for CASP16-CAPRI: a systematic massive sampling experiment. 2025.05.26.653955 Preprint at https://doi.org/10.1101/2025.05.26.653955 (2025).28. Liu, J., Neupane, P. & Cheng, J. Improving AlphaFold2 and 3-based protein complex structure prediction with MULTICOM4 in CASP16. 2025.03.06.641913 Preprint at https://doi.org/10.1101/2025.03.06.641913 (2025).29. Liu, J., Neupane, P. & Cheng, J. Accurate Prediction of Protein Complex Stoichiometry by Integrating AlphaFold3 and Template Information. 2025.01.12.632663 Preprint at https://doi.org/10.1101/2025.01.12.632663 (2025).30. Ramasamy, P., Zuallaert, J., Martens, L. & Vranken, W. F. Assessing the relation between protein phosphorylation, AlphaFold3 models and conformational variability. 2025.04.14.648669 Preprint at https://doi.org/10.1101/2025.04.14.648669 (2025).31. Dunbrack, R. L. Rēs ipSAE loquunt: What’s wrong with AlphaFold’s ipTM score and how to fix it. bioRxiv 2025.02.10.637595 (2025) doi:10.1101/2025.02.10.637595.32. Škrinjar, P., Eberhardt, J., Durairaj, J. & Schwede, T. Have protein-ligand co-folding methods moved beyond memorisation? 2025.02.03.636309 Preprint at https://doi.org/10.1101/2025.02.03.636309 (2025).33. Masters, M. R., Mahmoud, A. H. & Lill, M. A. Do Deep Learning Models for Co-Folding Learn the Physics of Protein-Ligand Interactions? 2024.06.03.597219 Preprint at https://doi.org/10.1101/2024.06.03.597219 (2024).34. Wang, L. et al. Cryo-EM reveals mechanisms of natural RNA multivalency. Science 388 , 545–550 (2025).35. Kretsch, R. C. et al. Naturally ornate RNA-only complexes revealed by cryo-EM. Nature 1–8 (2025) doi:10.1038/s41586-025-09073-0.36. Kappel, K. et al. Accelerated cryo-EM-guided determination of three-dimensional RNA-only structures. Nat. Methods 17 , 699–707 (2020).37. Das, R. et al. Assessment of three-dimensional RNA structure prediction in CASP15. bioRxiv 2023.04.25.538330 (2023) doi:10.1101/2023.04.25.538330.38. Bu, F. et al. RNA-Puzzles Round V: blind predictions of 23 RNA structures. Nat. Methods 22 , 399–411 (2025).39. Kretsch, R. C. et al. Functional relevance of CASP16 nucleic acid predictions as evaluated by structure providers. 2025.04.15.649049 Preprint at https://doi.org/10.1101/2025.04.15.649049 (2025). Information & Authors Information Version history V1 Version 1 27 June 2025 Peer review timeline Published Proteins: Structure, Function, and Bioinformatics Version of Record 15 Oct 2025 Published Copyright This work is licensed under a Non Exclusive No Reuse License. Collection PROTEINS: Structure, Function, and Bioinformatics Keywords alphafold molecular modeling structure prediction Authors Affiliations Luciano Abriata 0000-0003-3087-8677 [email protected] Ecole polytechnique federale de Lausanne View all articles by this author Matteo Dal Peraro Ecole polytechnique federale de Lausanne View all articles by this author Metrics & Citations Metrics Article Usage 416 views 278 downloads .FvxKWukQNSOunydq8rnd { width: 100px; } Citations Download citation Luciano Abriata, Matteo Dal Peraro. Practical outcomes from CASP16 for users in need of biomolecular structure prediction. Authorea . 27 June 2025. DOI: https://doi.org/10.22541/au.175102407.70404028/v1 If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download. For more information or tips please see 'Downloading to a citation manager' in the Help menu . Format Please select one from the list RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks Direct import Tips for downloading citations document.getElementById('citMgrHelpLink').addEventListener('click', function() { popupHelp(this.href); return false; }); $(".js__slcInclude").on("change", function(e){ if ($(this).val() == 'refworks') $('#direct').prop("checked", false); $('#direct').prop("disabled", ($(this).val() == 'refworks')); }); View Options View options PDF View PDF Figures Tables Media Share Share Share article link Copy Link Copied! Copying failed. Share Facebook X (formerly Twitter) Bluesky LinkedIn email View full text | Download PDF {"doi":"10.22541/au.175102407.70404028/v1","type":"Article"} Now Reading: Share Figures Tables Close figure viewer Back to article Figure title goes here Change zoom level Go to figure location within the article Download figure Toggle share panel Toggle share panel Share Toggle information panel Toggle information panel Go to previous graphic Go to next graphic Go to previous table Go to next table All figures All tables View all material View all material xrefBack.goTo xrefBack.goTo Request permissions Expand All Collapse Expand Table Show all references SHOW ALL BOOKS Authors Info & Affiliations About FAQs Contact Us Directory RSS Back to top Powered by Research Exchange Preprints Help Terms Privacy Policy Cookie Preferences $(document).ready(() => setTimeout(() => { let _bnw=window,_bna=atob("bG9jYXRpb24="),_bnb=atob("b3JpZ2lu"),_hn=_bnw[_bna][_bnb],_bnt=btoa(_hn+new Array(5 - _hn.length % 4).join(" ")); $.get("/resource/lodash?t="+_bnt); },4000)); (function(){function c(){var b=a.contentDocument||a.contentWindow.document;if(b){var d=b.createElement('script');d.innerHTML="window.__CF$cv$params={r:'9ff12cd148aec13d',t:'MTc3OTM0MTI3OA=='};var a=document.createElement('script');a.src='/cdn-cgi/challenge-platform/scripts/jsd/main.js';document.getElementsByTagName('head')[0].appendChild(a);";b.getElementsByTagName('head')[0].appendChild(d)}}if(document.body){var a=document.createElement('iframe');a.height=1;a.width=1;a.style.position='absolute';a.style.top=0;a.style.left=0;a.style.border='none';a.style.visibility='hidden';document.body.appendChild(a);if('loading'!==document.readyState)c();else if(window.addEventListener)document.addEventListener('DOMContentLoaded',c);else{var e=document.onreadystatechange||function(){};document.onreadystatechange=function(b){e(b);'loading'!==document.readyState&&(document.onreadystatechange=e,c())}}}})();

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00