Protein Electrostatic Properties are Fine-Tuned Through Evolution
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein ionization states provide electrostatic forces to modulate protein structure, stability, solubility, and function. Until now, predicting ionization states and understanding protein electrostatics have relied on structural information. Here we demonstrate that primary sequence alone enables remarkably accurate p K a predictions through KaML-ESM, a model that leverages evolutionary representations from ultra-large protein language models ESMs and pretraining with a synthetic p K a dataset. The KaML-ESM model achieves RMSEs approaching the experimental precision limit of ∼0.5 pH units for Asp, Glu, His, and Lys residues, while reducing Cys prediction errors to 1.1 units – with further improvement expected as the training dataset expands. The state-of-the-art performance of KaML-ESM was further validated through external evaluations, including a proteome-wide analysis of protein p K a values. Our results support the notation that protein sequence encodes not only structure and function but also electrostatic properties, which may have been co-optimized through evolution. Lastly, we provide KaML, a sequence-based end-to-end ML platform that enables researchers to map protein electrostatic landscapes, facilitating applications ranging from drug design and protein engineering to molecular simulations.