MULAN: multimodal protein language model for sequence and structure encoding
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Most protein language models (PLMs) produce high-quality representations using only protein sequences. However, incorporating known protein structures is important for many prediction tasks, leading to increased interest in structure-aware PLMs. Currently, structure-aware PLMs are either trained from scratch or add significant parameter overhead for the structure encoder.
Results
In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced parameter-efficient Structure Adapter, which are then fused and trained together. Based on the evaluation of nine downstream tasks, MULAN models of various sizes show a quality improvement compared to both sequence-only ESM2 and structure-aware SaProt. The main improvements are shown for the protein–protein interaction prediction (up to 0.12 in AUROC). Importantly, unlike other models, MULAN offers a cheap increase in structural awareness of protein representations because of the finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure.
Availability and implementation
The implementation, training data, and model checkpoints are available at https://github.com/DFrolova/MULAN.