Thrifty wide-context models of B cell receptor somatic hypermutation

preprint OA: closed
📄 Open PDF Full text JSON View at publisher
Full text 3,305 characters · extracted from oa-doi-fallback · click to expand
Abstract Somatic hypermutation (SHM) is the diversity-generating process in antibody affinity maturation. Probabilistic models of SHM are needed for analyzing rare mutations, for understanding the selective forces guiding affinity maturation, and for understanding the underlying biochemical process. High throughput data offers the potential to develop and fit models of SHM on relevant data sets. In this paper we model SHM using modern frameworks. We are motivated by recent work suggesting the importance of a wider context for SHM, however, assigning an independent rate to each k-mer leads to an exponential proliferation of parameters. Thus, using convolutions on 3-mer embeddings, we develop “thrifty” models of SHM of various sizes; these can have fewer free parameters than a 5-mer model and yet have a significantly wider context. These offer a slight performance improvement over a 5-mer model, and other modern model elaborations worsen performance. We also find that a per-site effect is not necessary to explain SHM patterns given nucleotide context. Also, the two current methods for fitting an SHM model — on out-of-frame sequence data and on synonymous mutations — produce significantly different results, and augmenting out-of-frame data with synonymous mutations does not aid out-of-sample performance. Competing Interest Statement The authors have declared no competing interest. Footnotes Revising to address reviewer comments from eLife. There isn't room here to describe them all, but here are the major ones: > (1) 10x/single cell data has a fairly different error profile compared > to bulk data. A synonymous model should be built from the same > `briney` dataset as the base model to validate the difference between > the two types of training data. We have repeated the same analysis with synonymous mutations derived from the bulk-sequenced `tang` dataset and for Figure 4 and the supplementary figure. The conclusion remains the same. We used `tang` because only the out of frame sequences were available to us for the `briney` data set as we were using preprocessing from the Spisak paper. > (6) Have you looked to see if Soto et al Nature 2019 > (https://doi.org/10.1038/s41586-019-0934-8) provides usable data for > your purposes? Thank you for making us aware of this data set! However, we are afraid that it did not provide the large volume of out-of-frame data that we were hoping for, as we now describe: """ From Soto et al. (2019), we obtained pre-processed data for all 3 HIP donors from the authors. We ran our pipeline on a large subset of the data (sampling the first 1 million sequences for each donor IgH fasta file) to assess its potential for our purposes. From the 3 million sequences processed, we extracted 2,686 out-of-frame sequences in total. These sequences corresponded to 11 clonal families of size 2+ and 2,618 singletons. We obtained 102 parent-child pairs from non- singletons, of which only 57 contained a mutation event. The relatively low recovery of out-of-frame sequences in this subset of the data suggested that processing the full dataset would not yield a meaningful amount of parent-child pairs for this study. We additionally observed that all of these sequences had no coverage at the start of the V gene, missing the first 12-60 bases.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2024) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00