Improving Biological Sequence Prediction with AlphaFold2 Representation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Accurate prediction of functional sites from primary sequences is essential for elucidating biological mechanisms and advancing rational drug design. However, traditional sequence-based features are inherently unable to capture complex structural protein contexts. Recently, AlphaFold2 (AF2) revolutionized protein structure prediction, raising expectations of AF2 to serve as a feature extractor providing structure-rich representation, which can be useful for sequence-based prediction, particularly for unknown sequences.

Results

We present a novel feature-engineering paradigm that leverages a high-dimensional latent representation matrix (of L × D , where L is the sequence length and D is the feature dimension size) extracted directly from the AF2 Evoformer module. We systematically evaluated the AF2 representation, comparing with conventional sequence-based features, such as hidden Markov model profiles, using a variety of machine learning models, on two structurally contrasting tasks, calpain cleavage site and nucleic-acid-binding site prediction. The AF2 representation outperformed conventional sequence-based features clearly and entirely, particularly for targets with low sequence homology to training data. Furthermore, interpretability analyses, using SHapley Additive exPlanations (SHAP) and Uniform MAnifoldapproximation and Projection (UMAP), showed more details behind the performance advantage of AF2 representation through feature importance ranking and visualization. Overall, these empirical results confirmed that AF2 representation could effectively bridge the sequence-to-structure gap as a feature input for sequence prediction, without increasing heavy computational burden.

Availability and implementation

Source code, pre-trained models, and datasets are freely available to non-commercial users at https://github.com/Lili-irtyd/Improve-biological-sequences-prediction-by-AlphaFold2 .

Contact

mami@kuicr.kyoto-u.ac.jp

Article activity feed