MULAN: Multimodal Protein Language Model for Sequence and Structure Encoding

Daria Frolova
Marina A. Pak
Anna Litvin
Ilya Sharov
Dmitry N. Ivankov
Ivan Oseledets

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Most protein language models (PLMs), which are used to produce high-quality protein representations, use only protein sequences during training. However, the known protein structure is crucial in many protein property prediction tasks, so there is a growing interest in incorporating the knowledge about the protein structure into a PLM. In this study, we propose MULAN, a MULtimodal PLM for both sequence and ANgle-based structure encoding. MULAN has a pre-trained sequence encoder and an introduced Structure Adapter, which are then fused and trained together. According to the evaluation on 7 downstream tasks of various nature, both small and medium-sized MULAN models show consistent improvement in quality compared to both sequence-only ESM-2 and structure-aware SaProt. Importantly, our model offers a cheap increase in the structural awareness of the protein representations due to finetuning of existing PLMs instead of training from scratch. We perform a detailed analysis of the proposed model and demonstrate its awareness of the protein structure. The implementation, training data and model checkpoints are available at https://github.com/DFrolova/MULAN .

Version published to 10.1101/2024.05.30.596565v1 on bioRxiv
Jun 2, 2024

Listed in

Abstract

Article activity feed