Thrifty wide-context models of B cell receptor somatic hypermutation

Kevin Sung
Mackenzie M Johnson
Will Dumm
Noah Simon
Hugh Haddox
Julia Fukuyama
Frederick A Matsen

Curated by eLife

eLife Assessment

This study provides an important method to model the statistical biases of hypermutations during the affinity maturation of antibodies. The authors show convincingly that their model outperforms previous methods with fewer parameters; this is made possible by the use of machine learning to expand the context dependence of the mutation bias. They also show that models learned from nonsynonymous mutations and from out-of-frame sequences are different, prompting new questions about germinal center function. Strengths of the study include an open-access tool for using the model, a careful curation of existing datasets, and a rigorous benchmark; it is also shown that current machine-learning methods are currently limited by the availability of data, which explains the only modest gain in model performance afforded by modern machine learning.

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (eLife)

Abstract

Somatic hypermutation (SHM) is the diversity-generating process in antibody affinity maturation. Probabilistic models of SHM are needed for analyzing rare mutations, for understanding the selective forces guiding affinity maturation, and for understanding the underlying biochemical process. High throughput data offers the potential to develop and fit models of SHM on relevant data sets. In this paper we model SHM using modern frameworks. We are motivated by recent work suggesting the importance of a wider context for SHM, however, assigning an independent rate to each k-mer leads to an exponential proliferation of parameters. Thus, using convolutions on 3-mer embeddings, we develop “thrifty” models of SHM that have fewer free parameters than a 5-mer model and yet have a significantly wider context. These offer a slight performance improvement over a 5-mer model. We also find that a per-site effect is not necessary to explain SHM patterns given nucleotide context. Also, the two current methods for fitting an SHM model — on out-of-frame sequence data and on synonymous mutations — produce significantly different results, and augmenting out-of-frame data with synonymous mutations does not aid out-of-sample performance.

eLife
Mar 18, 2025

eLife Assessment

This study provides an important method to model the statistical biases of hypermutations during the affinity maturation of antibodies. The authors show convincingly that their model outperforms previous methods with fewer parameters; this is made possible by the use of machine learning to expand the context dependence of the mutation bias. They also show that models learned from nonsynonymous mutations and from out-of-frame sequences are different, prompting new questions about germinal center function. Strengths of the study include an open-access tool for using the model, a careful curation of existing datasets, and a rigorous benchmark; it is also shown that current machine-learning methods are currently limited by the availability of data, which explains the only modest gain in model performance afforded by modern …

eLife Assessment

This study provides an important method to model the statistical biases of hypermutations during the affinity maturation of antibodies. The authors show convincingly that their model outperforms previous methods with fewer parameters; this is made possible by the use of machine learning to expand the context dependence of the mutation bias. They also show that models learned from nonsynonymous mutations and from out-of-frame sequences are different, prompting new questions about germinal center function. Strengths of the study include an open-access tool for using the model, a careful curation of existing datasets, and a rigorous benchmark; it is also shown that current machine-learning methods are currently limited by the availability of data, which explains the only modest gain in model performance afforded by modern machine learning.

Read the original source
eLife
Mar 18, 2025

Reviewer #1 (Public review):

Summary:

This paper introduces a new class of machine learning models for capturing how likely a specific nucleotide in a rearranged IG gene is to undergo somatic hypermutation. These models modestly outperform existing state-of-the-art efforts, despite having fewer free parameters. A surprising finding is that models trained on all mutations from non-functional rearrangements give divergent results from those trained on only silent mutations from functional rearrangements.

Strengths:

(1) The new model structure is quite clever and will provide a powerful way to explore larger models.

(2) Careful attention is paid to curating and processing large existing data sets.

(3) The authors are to be commended for their efforts to communicate with the developers of previous models and use the strongest possible …

Reviewer #1 (Public review):

Summary:

This paper introduces a new class of machine learning models for capturing how likely a specific nucleotide in a rearranged IG gene is to undergo somatic hypermutation. These models modestly outperform existing state-of-the-art efforts, despite having fewer free parameters. A surprising finding is that models trained on all mutations from non-functional rearrangements give divergent results from those trained on only silent mutations from functional rearrangements.

Strengths:

(1) The new model structure is quite clever and will provide a powerful way to explore larger models.

(2) Careful attention is paid to curating and processing large existing data sets.

(3) The authors are to be commended for their efforts to communicate with the developers of previous models and use the strongest possible versions of those in their current evaluation.

Weaknesses:

(1) 10x/single cell data has a fairly different error profile compared to bulk data. A synonymous model should be built from the same `briney` dataset as the base model to validate the difference between the two types of training data.

(3) The decision to test only kernels of 7, 9, and 11 is not described. The selection/optimization of embedding size is not explained. The filters listed in Table 1 are not defined.

Read the original source
eLife
Mar 18, 2025

Reviewer #2 (Public review):

This work offers an insightful contribution for researchers in computational biology, immunology, and machine learning. By employing a 3-mer embedding and CNN architecture, the authors demonstrate that it is possible to extend sequence context without exponentially increasing the model's complexity.

Key findings include:

(1) Efficiency and Performance: Thrifty CNNs outperform traditional 5-mer models and match the performance of significantly larger models like DeepSHM.

(2) Neutral Mutation Data: A distinction is made between using synonymous mutations and out-of-frame sequences for model training, with evidence suggesting these methods capture different aspects of SHM, or different biases in the type of data.

(3) Open Source Contributions: The release of a Python package and pre-trained models adds …

Reviewer #2 (Public review):

This work offers an insightful contribution for researchers in computational biology, immunology, and machine learning. By employing a 3-mer embedding and CNN architecture, the authors demonstrate that it is possible to extend sequence context without exponentially increasing the model's complexity.

Key findings include:

(1) Efficiency and Performance: Thrifty CNNs outperform traditional 5-mer models and match the performance of significantly larger models like DeepSHM.

(2) Neutral Mutation Data: A distinction is made between using synonymous mutations and out-of-frame sequences for model training, with evidence suggesting these methods capture different aspects of SHM, or different biases in the type of data.

(3) Open Source Contributions: The release of a Python package and pre-trained models adds practical value for the community.

However, readers should be aware of the limitations. The improvements over existing models are modest, and the work is constrained by the availability of high-quality out-of-frame sequence data. The study also highlights that more complex modeling techniques, like transformers, did not enhance predictive performance, which underscores the role of data availability in such studies.

Read the original source
eLife
Mar 18, 2025

Reviewer #3 (Public review):

Summary:

Modeling and estimating sequence context biases during B cell somatic hypermutation is important for accurately modeling B cell evolution to better understand responses to infection and vaccination. Sung et al. introduce new statistical models that capture a wider sequence context of somatic hypermutation with a comparatively small number of additional parameters. They demonstrate their model's performance with rigorous testing across multiple subjects and datasets. Prior work has captured the mutation biases of fixed 3-, 5-, and 7-mers, but each of these expansions has significantly more parameters. The authors developed a machine-learning-based approach to learn these biases using wider contexts with comparatively few parameters.

Strengths:

Well-motivated and defined problem. Clever solution to …

Reviewer #3 (Public review):

Summary:

Modeling and estimating sequence context biases during B cell somatic hypermutation is important for accurately modeling B cell evolution to better understand responses to infection and vaccination. Sung et al. introduce new statistical models that capture a wider sequence context of somatic hypermutation with a comparatively small number of additional parameters. They demonstrate their model's performance with rigorous testing across multiple subjects and datasets. Prior work has captured the mutation biases of fixed 3-, 5-, and 7-mers, but each of these expansions has significantly more parameters. The authors developed a machine-learning-based approach to learn these biases using wider contexts with comparatively few parameters.

Strengths:

Well-motivated and defined problem. Clever solution to expand nucleotide context. Complete separation of training and test data by using different subjects for training vs testing. Release of open-source tools and scripts for reproducibility.

Weaknesses:

This study could be improved with better descriptions of dataset sequencing technology, sequencing depth, etc but this is a minor weakness.

Read the original source
Version published to 10.7554/elife.105471.1 on eLife
Mar 18, 2025
Version published to 10.7554/elife.105471 on eLife
Mar 18, 2025
Version published to 10.1101/2024.11.26.625407v1 on bioRxiv
Dec 1, 2024

NetTCR-struc, a structure driven approach for prediction of TCR-pMHC interactions

This article has 2 authors:
1. Sebastian N Deleuran
2. Morten Nielsen
This article has no evaluationsLatest version Mar 25, 2025
Unraveling HIV protease drug resistance and genetic diversity with kernel methods

This article has 1 author:
1. Elies Ramon
This article has no evaluationsLatest version Mar 29, 2025
Inferring context-specific site variation with evotuned protein language models

This article has 4 authors:
1. Spyros Lytras
2. Adam Strange
3. Jumpei Ito
4. Kei Sato
This article has no evaluationsLatest version Feb 20, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

NetTCR-struc, a structure driven approach for prediction of TCR-pMHC interactions

Unraveling HIV protease drug resistance and genetic diversity with kernel methods

Inferring context-specific site variation with evotuned protein language models