Multi-Modal Large Language Model Enables Protein Function Prediction

Mingjia Huo
Han Guo
Xingyi Cheng
Digvijay Singh
Hamidreza Rahmani
Shen Li
Philipp Gerlof
Trey Ideker
Danielle A. Grotjahn
Elizabeth Villa
Le Song
Pengtao Xie

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Predicting the functions of proteins can greatly accelerate biological discovery and applications, where deep learning methods have recently shown great potential. However, these methods predominantly predict protein functions as discrete categories, which fails to capture the nuanced and complex nature of protein functions. Furthermore, existing methods require the development of separate models for each prediction task, a process that can be both resource-heavy and time-consuming. Here, we present ProteinChat, a versatile, multi-modal large language model that takes a protein’s amino acid sequence as input and generates comprehensive narratives describing its function. ProteinChat is trained using over 1,500,000 (protein, prompt, answer) triplets curated from the Swiss-Prot dataset, covering diverse functions. This novel model can universally predict a wide range of protein functions, all within a single, unified framework. Furthermore, ProteinChat supports interactive dialogues with human users, allowing for iterative refinement of predictions and deeper exploration of protein functions. Our experimental results, evaluated through both human expert assessment and automated metrics, demonstrate that ProteinChat outperforms general-purpose LLMs like GPT-4, one of the flagship LLMs, by over ten-fold. In addition, ProteinChat exceeds or matches the performance of task-specific prediction models.

Version published to 10.1101/2024.08.19.608729 on bioRxiv
Aug 20, 2024

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

This article has 3 authors:
1. Pitágoras de Azevedo Alves Sobrinho
2. Tetsu Sakamoto
3. Wilfredo Blanco Figuerola
This article has no evaluationsLatest version Oct 1, 2025
E1: Retrieval-Augmented Protein Encoder Models

This article has 5 authors:
1. Sarthak Jain
2. Joel Beazer
3. Jeffrey A. Ruffolo
4. Aadyot Bhatnagar
5. Ali Madani
This article has no evaluationsLatest version Nov 13, 2025
PEPE: Scalable extraction of multi-modal protein language model representations

This article has 6 authors:
1. Jahn Zhong
2. Niccolò Cardente
3. Geir Kjetil Sandve
4. Habib Bashour
5. Maria Francesca Abbate
6. Victor Greiff
This article has no evaluationsLatest version Oct 14, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Protein Dimension DB: A Unified Protein Repository for Representation Learning and Functional Analysis

E1: Retrieval-Augmented Protein Encoder Models

PEPE: Scalable extraction of multi-modal protein language model representations