ICEPIC: A Toolkit to Discover Ice Binding Proteins from Sequence
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Ice binding proteins, such as antifreeze proteins (AFPs) and ice nucleation proteins (INPs), are critical for survival in subzero environments and have wide-ranging applications in biotechnology, agriculture, and materials science. Current discovery methods for these proteins are constrained by low throughput and limited datasets that are not conducive for engineering. Here, we present a high-throughput, sequence-based model that leverages contextual embeddings from protein language models to predict ice binding potential, as well as the expression and activity potential of candidate proteins. Using a curated data corpus of over 18,000 ice binding proteins — far larger than previous datasets — we fine-tuned a ProtBERT-based model, achieving 99% accuracy for prediction of ice binding potential. Sensitivity analyses through targeted mutagenesis (alanine and threonine substitutions) confirmed the model’s biological significance, revealing functionally important residues and sequence patterns. Additionally, we developed an expression prediction model that achieved an R 2 score of 0.64 and low false-negative rates in identifying highly expressible candidates in Pichia pastoris . An additional regression model trained to predict ice activity as measured by thermal hysteresis achieved an R 2 score of at least 0.79 with a clear difference in prediction between ice binding and non-ice binding proteins. Our toolkit advances the predictive accuracy, interpretability, and scalability of ice binding protein discovery, offering a powerful tool for protein engineering in cold-environment applications.
Significance Statement
Ice binding proteins enable organisms to survive freezing temperatures and are essential for applications in cryopreservation, agriculture, and materials science. However, discovering and engineering these proteins has been limited by small datasets and inadequate predictive tools. We developed a machine learning model trained on over 18,000 ice binding protein sequences to predict not only ice binding potential but also protein activity and expression in engineered hosts. This approach integrates advanced protein language models with biological context, enabling faster, more reliable discovery of ice binding proteins. Our platform advances rational protein design for real-world applications in climate resilience, biotechnology, and beyond.