Improving Biological Sequence Prediction with AlphaFold2 Representation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Accurate prediction of functional sites from primary sequences is essential for elucidating biological mechanisms and advancing rational drug design. However, traditional sequence-based features are inherently unable to capture complex structural protein contexts. Recently, AlphaFold2 (AF2) revolutionized protein structure prediction, raising expectations of AF2 to serve as a feature extractor providing structure-rich representation, which can be useful for sequence-based prediction, particularly for unknown sequences.
Results
We present a novel feature-engineering paradigm that leverages a high-dimensional latent representation matrix (of L × D , where L is the sequence length and D is the feature dimension size) extracted directly from the AF2 Evoformer module. We systematically evaluated the AF2 representation, comparing with conventional sequence-based features, such as hidden Markov model profiles, using a variety of machine learning models, on two structurally contrasting tasks, calpain cleavage site and nucleic-acid-binding site prediction. The AF2 representation outperformed conventional sequence-based features clearly and entirely, particularly for targets with low sequence homology to training data. Furthermore, interpretability analyses, using SHapley Additive exPlanations (SHAP) and Uniform MAnifoldapproximation and Projection (UMAP), showed more details behind the performance advantage of AF2 representation through feature importance ranking and visualization. Overall, these empirical results confirmed that AF2 representation could effectively bridge the sequence-to-structure gap as a feature input for sequence prediction, without increasing heavy computational burden.
Availability and implementation
Source code, pre-trained models, and datasets are freely available to non-commercial users at https://github.com/Lili-irtyd/Improve-biological-sequences-prediction-by-AlphaFold2 .
Contact
mami@kuicr.kyoto-u.ac.jp