Large-scale exploration of protein space by automated NMR

Abstract

Protein structures can now be predicted and designed at scale, yet experimental access to dynamics and conformational heterogeneity remains limited in throughput. This gap prevents a systematic understanding of how protein sequences encode motion and functional flexibility. Here, we establish a scalable experimental pipeline combining protein design, automated production, and nuclear magnetic resonance (NMR) spectroscopy to enable high-throughput characterization of protein structure and dynamics at atomic resolution. A single operator can produce and analyze hundreds of isotopically labeled proteins per week, with per-sample cost largely defined by DNA synthesis. To benchmark this approach, we experimentally characterized 384 de novo designed proteins spanning diverse regions of structure space. High-quality two-dimensional NMR spectra were obtained for 239 samples (62% of designs overall). NMR characterization confirmed that the designed proteins adopt their intended folds, and revealed unexpected local dynamics that are not captured by current computational models. Our approach establishes a foundation for data-driven modelling of sequence–structure–dynamics relationships and unlocks a new regime of statistical structural biology, where insight into protein biophysics is gained from experimental ensemble studies of suitably designed protein clusters.

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/18821335.

Overview

The traditional paradigm of "one structure, one function" is being replaced by the more accurate view that proteins exist as an ensemble of folded structures. While many proteins are presumed to adopt a single, predominant structure, thereby minimizing free energy landscape exploration upon folding, the prevalence of this phenomenon remains underexplored due to technical challenges in capturing these ensembles. However, many recent works demonstrate how alternate conformations within the ensemble exhibit distinct functions (dubbed "functional sub-states"), which can be selected for during evolution or rationally optimized by protein engineers. To date, advanced in ensemble exploration have, thus far, primarily been constrained by limitations in scale of "ensemble detecting" techniques.

Here, the authors present NMR-Automated Protein Production (NMR-APP), which aims to tackle this issue. This is a highly integrated, robot-assisted pipeline that automates steps from construct assembly and protein to purification, all the way to the collection of NMR spectra. The authors tested 384 protein designs in their pipeline, successfully purified 98.7% of the constructs, and obtained NMR spectra for 62% of them; all at a low cost of $25 per sample. They computed the spectral peak intensity coefficient of variation (CV) as a proxy for dynamics and demonstrated that these "statically" designed protein structures display pervasive dynamics. Further analysis of nine constructs to assess local conformational flexibility in backbones indicated the presence of heterogeneous and computationally unpredictable dynamical features. This work represents a significant advancement toward understanding sub-state sampling within ensembles, suggesting that such sampling may be an inherent characteristic of designed proteins with a single intended structural state.

Areas for improvement / questions for the authors

1. A commonly cited limitation of NMR is protein size. That said, there are no metrics presented in this work that comment on the length of designs. What is the mean and median length? What is the distribution? Is length correlated to any other metrics that may explain success rate in the NMR-APP pipeline, like the concentration of purified protein, or the expected extent of dynamical regions?

2. The authors mention the ability for Proteina to generate less alpha helix biased designs, a great inclusion (along with the use of FoldSeek), to enable broad sampling of structural features. This also seemed to improve the experimental success rate. However, in Figure S4, mean CV versus model used was not included in the cross correlations; a potentially interesting relationship. In other words, does one model have a greater propensity to produce more dynamical structures than the other? Additionally, based on the findings, could the authors speculate on what parameters within, or between, the models could be tweaked to increase the probability of encountering dynamical features in the designs?

3. In lines 245-247, the authors mention that these NMR datasets may help train models to predict dynamics from first principles. Although the authors do not speculate on a putative timeline for such advances, it seems like an ambitious claim. To my knowledge, and as the authors mention, most structural prediction models rely heavily on sequence data. Even with the increasing availability of structural data from pipelines like NMR-APP, is this a sufficient amount of data for training of AI/ML models?

4. The observation that p2 A4 the proportion of minor states was sampled <5% demonstrates the sensitivity of NMR-APP. Can the authors speculate, based on the CV of other proteins relative to p2 A4, whether other proteins likely exhibit a similar overwhelming majority of favouring a specific state? Furthermore, can the authors comment on how these rare states can be rationally exploited in various protein engineering tasks like allosteric control, fold-switching, or directed evolution of more active sub-states?

Summary

Overall, this work represents a technical tour de force in high-throughput structural biology. The current findings leave open questions regarding the pervasiveness of dynamics across proteins with vastly different sizes, as well as the volume of data required to shift AI/ML training from sequence-based to structure- or "first principles" based. Addressing the distribution of design lengths and the correlation between specific generative models and dynamic outcomes would further solidify this paper's impact on the field. Nevertheless, NMR-APP demonstrates that the theoretical concept of ensembles and sub-state sampling is becoming more practical to probe experimentally at scale – a very exciting avenue for expanding potential protein designs and better understanding protein evolution from a dynamical lens.

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they did not use generative AI to come up with new ideas for their review.

Read the original source

Large-scale exploration of protein space by automated NMR

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Competing interests

Use of Artificial Intelligence (AI)

Reshaping Biomolecular Structure Prediction through Strategic Conformational Exploration with HelixFold-S1

The Evolution of the AlphaFold Architecture

Benchmarking protein sequence and structure search methods for remote homology detection

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Competing interests

Use of Artificial Intelligence (AI)

Related articles

Reshaping Biomolecular Structure Prediction through Strategic Conformational Exploration with HelixFold-S1

The Evolution of the AlphaFold Architecture

Benchmarking protein sequence and structure search methods for remote homology detection