MULAN: multimodal protein language model for sequence and structure encoding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Motivation

Most protein language models (PLMs) produce high-quality representations using only protein sequences. However, incorporating known protein structures is important for many prediction tasks, leading to increased interest in structure-aware PLMs. Currently, structure-aware PLMs are either trained from scratch or add significant parameter overhead for the structure encoder.

Results

In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced parameter-efficient Structure Adapter, which are then fused and trained together. Based on the evaluation of nine downstream tasks, MULAN models of various sizes show a quality improvement compared to both sequence-only ESM2 and structure-aware SaProt. The main improvements are shown for the protein–protein interaction prediction (up to 0.12 in AUROC). Importantly, unlike other models, MULAN offers a cheap increase in structural awareness of protein representations because of the finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure.

Availability and implementation

The implementation, training data, and model checkpoints are available at https://github.com/DFrolova/MULAN.

Article activity feed