Full text
3,971 characters
· extracted from
oa-doi-fallback
· click to expand
Abstract
Predicting antibody and NANOBODY® VHH–antigen complexes remains a critical challenge for state-of-the-art structure prediction models, limiting their impact in therapeutic discovery pipelines. We introduce SNAC-DB, an ML-ready database and curation pipeline enriched with structural biology expertise, designed to accelerate model accuracy and generalization by providing 31–37% expanded structural diversity over existing resources like SAbDab through comprehensive re-curation that extracts maximum value from available experimental structures. SNAC-DB expands coverage by capturing often-overlooked complexes and accurately identifying complete multi-chain epitopes through improved biological-assembly-based logic. Built for ML practitioners, SNAC-DB provides standardized formats with multi-threshold structure-based clustering to enable principled sample weighting during training. Using a rigorous benchmark of public PDB entries deposited post-May 2024 plus confidential therapeutic structures, we evaluate seven leading models (Protenix-v1, OpenFold-3p2, RosettaFold-3, Boltz-2, Boltz-1x, Chai-1, and AlphaFold2.3-multimer) with evaluation methodology tailored to antibody/NAN-OBODY® VHH–antigen complexes to ensure correct handling of multi-chain epitopes, revealing systematic performance gaps: success rates rarely exceed 25%, confidence-based ranking fails to identify best predictions even when accurate structures exist in ensembles, and all models consistently struggle with therapeutically relevant NANOBODY® VHHs. Systematic evaluation of sampling strategies demonstrates that while generating 1000 samples per target substantially increases the likelihood of producing accurate structures (oracle selection improves from 11.9% to 50.5%), confidence-based ranking remains nearly flat (between 10.9% and 14.9%), revealing that improved ranking mechanisms represent a more tractable path to performance gains. Finally, fine-tuning GeoDock on SNAC-DB yields higher success rates than training on SAbDab (11.0% vs. 7.1% for antibodies; 7.0% vs. 4.0% for NANOBODY® VHHs), suggesting that SNAC-DB’s expanded structural diversity translates to improved model generalization.
Significance Statement Computational antibody/NANOBODY® VHH design shows promise but remains unreliable for therapeutic development. SNAC-DB provides 31–37% expanded structural diversity through comprehensive data curation, immediately accelerating model development. Benchmarking seven leading AI models reveals accuracy rarely exceeds 25% on therapeutic targets, with confidence-based ranking failing to identify correct structures even when they exist in model outputs. Training on SNAC-DB increases prediction accuracy, validating that high-quality, diverse training data is critical for advancing computational methods toward clinical impact.
Competing Interest Statement
All authors are or were employees of Sanofi at the time this research was conducted and may hold shares and/or stock options in the company. This work was funded by Sanofi.
Footnotes
↵† Work performed during co-op placement at Sanofi.
Abbreviations
- Ab
- antibody
- Ag
- antigen
- ASU
- asymmetric unit
- Cα
- alpha carbon
- CDR
- complementarity-determining region
- CDR-H3
- heavy chain 3rd CDR
- DockQ
- docking quality score
- Fab
- fragment antigen-binding
- FR
- framework region
- Fv
- variable fragment
- ipTM
- interface predicted template modeling score
- ML
- machine learning
- MSA
- multiple sequence alignment
- Nb
- N-ANOBODY® VHH
- npy
- NumPy array format
- OF3p2
- OpenFold-3 Preview2
- PDB
- Protein Data Bank
- pLDDT
- predicted local distance difference test
- RF3
- RosettaFold-3
- SAbDab
- Structural Antibody Database
- SNAC-DB
- Structural NANOBODY® VHH and Antibody Complex Database
- SOTA
- state-of-the-art
- TCR
- T cell receptor
- TM
- template modeling
- VH
- variable heavy chain
- VHH
- variable heavy chain of heavy-chain-only antibody
- VL
- variable light chain.
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.