KaMLs for Predicting Protein p K a Values and Ionization States: Are Trees All You Need?
This article has been Reviewed by the following groups
Listed in
- Reviewed articles (Biophysics Colab)
Abstract
Despite its importance in understanding biology and computer-aided drug discovery, the accurate prediction of protein ionization states remains a formidable challenge. Physics-based approaches struggle to capture the small, competing contributions in the complex protein environment, while machine learning (ML) is hampered by scarcity of experimental data. Here we report the development of p K a ML (KaML) models based on decision trees and graph attention networks (GAT), exploiting physicochemical understanding and a new experiment p K a database (PKAD-3) enriched with highly shifted p K a ’s. KaML-CBtree significantly outperforms the current state of the art in predicting p K a values and ionization states across all six titratable amino acids, notably achieving accurate predictions for deprotonated cysteines and lysines – a blind spot in previous models. The superior performance of KaMLs is achieved in part through several innovations, including separate treatment of acid and base, data augmentation using AlphaFold structures, and model pretraining on a theoretical p K a database. We also introduce the classification of protonation states as a metric for evaluating p K a prediction models. A meta-feature analysis suggests a possible reason for the lightweight tree model to outperform the more complex deep learning GAT. We release an end-to-end p K a predictor based on KaML-CBtree and the new PKAD-3 database, which facilitates a variety of applications and provides the foundation for further advances in protein electrostatics research.