Text2Protein: A Generative Model for Designated Protein Design on Given Description

Ramtin Hosseini
Siyang Zhang
Pengtao Xie

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Designing protein structures from text is challenging in computational biology. We propose Text2Protein, a pipeline combining large language models (LLMs) with diffusion models to generate full-atomic protein structures from text. Using a conditional diffusion model and the Vicuna-7B language model, we learn data distributions of 6D interresidue coordinates, refined into full-atomic structures with PyRosetta. Trained on a curated RCSB-PDB dataset, Text2Protein focuses on single-chain proteins with 40-256 residues. Our extensive experiments validate Text2Protein’s effectiveness by generating high-fidelity protein structures similar to ground truth proteins using raw texts. We evaluate Text2Protein using multiple metrics, including Mean Square Error (MSE) of 6D coordinates, Rosetta Energy Units (REU), and TM-score. Our results show that 5% of the generated proteins have a TM-score greater than 0.5, indicating similar folds in SCOP/CATH. Additionally, 16% of pairs have a TM-score greater than 0.4, 89% have a TM-score greater than 0.3, and none have a TM-score less than 0.17, below the threshold for unrelated proteins. Text2Protein presents a promising framework for automated protein design, potentially accelerating novel protein discovery. This work opens new avenues for integrating natural language understanding with protein structure generation, with implications in drug discovery, enzyme engineering, and material science.

Version published to 10.21203/rs.3.rs-4868665/v1 on Research Square
Sep 13, 2024

Quantum-Assisted Refinement of AlphaFold Protein Structures

This article has 1 author:
1. Parham Ghayour
This article has no evaluationsLatest version Dec 31, 2025
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

This article has 4 authors:
1. Tayyip Topuz
2. Zeki Erdem
3. Halil Bisgin
4. E. Demet Akten
This article has no evaluationsLatest version Feb 2, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Quantum-Assisted Refinement of AlphaFold Protein Structures

A Survey on Efficient Protein Language Models

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features