Improving Biological Sequence Prediction with AlphaFold2 Representation

Zhiqian Jiang
Canh Hao Nguyen
Hiroshi Mamitsuka

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Accurate prediction of functional sites from primary sequences is essential for elucidating biological mechanisms and advancing rational drug design. However, traditional sequence-based features are inherently unable to capture complex structural protein contexts. Recently, AlphaFold2 (AF2) revolutionized protein structure prediction, raising expectations of AF2 to serve as a feature extractor providing structure-rich representation, which can be useful for sequence-based prediction, particularly for unknown sequences.

Results

We present a novel feature-engineering paradigm that leverages a high-dimensional latent representation matrix (of L × D , where L is the sequence length and D is the feature dimension size) extracted directly from the AF2 Evoformer module. We systematically evaluated the AF2 representation, comparing with conventional sequence-based features, such as hidden Markov model profiles, using a variety of machine learning models, on two structurally contrasting tasks, calpain cleavage site and nucleic-acid-binding site prediction. The AF2 representation outperformed conventional sequence-based features clearly and entirely, particularly for targets with low sequence homology to training data. Furthermore, interpretability analyses, using SHapley Additive exPlanations (SHAP) and Uniform MAnifoldapproximation and Projection (UMAP), showed more details behind the performance advantage of AF2 representation through feature importance ranking and visualization. Overall, these empirical results confirmed that AF2 representation could effectively bridge the sequence-to-structure gap as a feature input for sequence prediction, without increasing heavy computational burden.

Availability and implementation

Source code, pre-trained models, and datasets are freely available to non-commercial users at https://github.com/Lili-irtyd/Improve-biological-sequences-prediction-by-AlphaFold2 .

Contact

mami@kuicr.kyoto-u.ac.jp

Version published to 10.64898/2026.04.26.720550 on bioRxiv
Apr 28, 2026

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

This article has 4 authors:
1. Giulia Peteani
2. Gianmattia Sgueglia
3. Thomas Lemmin
4. Marco Chino
This article has no evaluationsLatest version May 5, 2026
GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

This article has 7 authors:
1. Bing Rao
2. Jie Bai
3. Maha A. Thafar
4. Somayah Albaradei
5. Kamran Arshad
6. Apilak Worachartcheewanh
7. Muhammad Arif
This article has no evaluationsLatest version Mar 26, 2026
Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models

This article has 8 authors:
1. Lei Xie
2. Enjia Ye
3. Haodong Wang
4. Tianyou Zhang
5. Qihang Zhen
6. Fang Liang
7. Dong Liu
8. Guijun Zhang
This article has no evaluationsLatest version Apr 16, 2026

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability and implementation

Contact

Article activity feed

Related articles

Structure-derived synthetic sequences guide a protein language model toward metalloproteins

GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

Highly Accurate Estimation of the Fold Accuracy of Protein Structural Models