Enhancing protein structure prediction accuracy by prioritizing important residues using protein language models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate prediction of protein tertiary structures from amino acid sequences remains a fundamental challenge in computational biology. Although AlphaFold2 represents a major advance, systematic discrepancies persist between its predictions and experimentally determined structures. Given that individual residues contribute differentially to protein function, we hypothesized that incorporating residue-specific importance metrics could improve prediction accuracy. Here, we develop i -Fold ( importance Fold), an enhanced neural architecture enhances the AlphaFold2 architecture by integrating protein language model ESM-derived residue importance scores (RIS) as dynamic positional weights during training. Our approach dynamically weights amino acids using RIS during structure prediction, thereby directing computational attention toward functionally critical residues and regions. Evaluation on a benchmark test set of 3,559 protein structures reveals that i -Fold significantly improves accuracy (reduction in r.m.s.d., p = 0) and achieves a higher prediction success rate (7.6% improvement: 55.1% → 62.7%). Notably, i -Fold demonstrates particular improvements for targets that are typically challenging for AlphaFold2, including ribosomal proteins, membrane proteins, and orphan proteins. Consistent results were obtained on a completely independent test set of 167 recently released protein structures, where i -Fold again exhibited a higher prediction success rate (6.0% improvement: 43.7% → 49.7%) compared to AlphaFold2. Our findings indicate that explicit integration of RIS can advance the state-of-the-art in protein structure prediction, producing more accurate and generalizable models without substantially increasing computational cost.