SNAC-DB: An ML-Ready Database for Antibody and NANO-BODY® VHH–Antigen Complexes with Expanded Structural Diversity and Real-World Benchmarking

doi:10.64898/2026.04.22.720253

SNAC-DB: An ML-Ready Database for Antibody and NANO-BODY® VHH–Antigen Complexes with Expanded Structural Diversity and Real-World Benchmarking

2026 · doi:10.64898/2026.04.22.720253

preprint OA: closed

Full text JSON View at publisher

Full text 3,971 characters · extracted from oa-doi-fallback · click to expand

Abstract Predicting antibody and NANOBODY® VHH–antigen complexes remains a critical challenge for state-of-the-art structure prediction models, limiting their impact in therapeutic discovery pipelines. We introduce SNAC-DB, an ML-ready database and curation pipeline enriched with structural biology expertise, designed to accelerate model accuracy and generalization by providing 31–37% expanded structural diversity over existing resources like SAbDab through comprehensive re-curation that extracts maximum value from available experimental structures. SNAC-DB expands coverage by capturing often-overlooked complexes and accurately identifying complete multi-chain epitopes through improved biological-assembly-based logic. Built for ML practitioners, SNAC-DB provides standardized formats with multi-threshold structure-based clustering to enable principled sample weighting during training. Using a rigorous benchmark of public PDB entries deposited post-May 2024 plus confidential therapeutic structures, we evaluate seven leading models (Protenix-v1, OpenFold-3p2, RosettaFold-3, Boltz-2, Boltz-1x, Chai-1, and AlphaFold2.3-multimer) with evaluation methodology tailored to antibody/NAN-OBODY® VHH–antigen complexes to ensure correct handling of multi-chain epitopes, revealing systematic performance gaps: success rates rarely exceed 25%, confidence-based ranking fails to identify best predictions even when accurate structures exist in ensembles, and all models consistently struggle with therapeutically relevant NANOBODY® VHHs. Systematic evaluation of sampling strategies demonstrates that while generating 1000 samples per target substantially increases the likelihood of producing accurate structures (oracle selection improves from 11.9% to 50.5%), confidence-based ranking remains nearly flat (between 10.9% and 14.9%), revealing that improved ranking mechanisms represent a more tractable path to performance gains. Finally, fine-tuning GeoDock on SNAC-DB yields higher success rates than training on SAbDab (11.0% vs. 7.1% for antibodies; 7.0% vs. 4.0% for NANOBODY® VHHs), suggesting that SNAC-DB’s expanded structural diversity translates to improved model generalization. Significance Statement Computational antibody/NANOBODY® VHH design shows promise but remains unreliable for therapeutic development. SNAC-DB provides 31–37% expanded structural diversity through comprehensive data curation, immediately accelerating model development. Benchmarking seven leading AI models reveals accuracy rarely exceeds 25% on therapeutic targets, with confidence-based ranking failing to identify correct structures even when they exist in model outputs. Training on SNAC-DB increases prediction accuracy, validating that high-quality, diverse training data is critical for advancing computational methods toward clinical impact. Competing Interest Statement All authors are or were employees of Sanofi at the time this research was conducted and may hold shares and/or stock options in the company. This work was funded by Sanofi. Footnotes ↵† Work performed during co-op placement at Sanofi. Abbreviations - Ab - antibody - Ag - antigen - ASU - asymmetric unit - Cα - alpha carbon - CDR - complementarity-determining region - CDR-H3 - heavy chain 3rd CDR - DockQ - docking quality score - Fab - fragment antigen-binding - FR - framework region - Fv - variable fragment - ipTM - interface predicted template modeling score - ML - machine learning - MSA - multiple sequence alignment - Nb - N-ANOBODY® VHH - npy - NumPy array format - OF3p2 - OpenFold-3 Preview2 - PDB - Protein Data Bank - pLDDT - predicted local distance difference test - RF3 - RosettaFold-3 - SAbDab - Structural Antibody Database - SNAC-DB - Structural NANOBODY® VHH and Antibody Complex Database - SOTA - state-of-the-art - TCR - T cell receptor - TM - template modeling - VH - variable heavy chain - VHH - variable heavy chain of heavy-chain-only antibody - VL - variable light chain.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00