General-Purpose Large Language Models, such as DeepSeek V3.2, Have Evolved Protein Design Capabilities
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
General-Purpose Large Language Models (GLLMs), although primarily developed for natural language processing, are increasingly demonstrating emergent capabilities in specialized scientific domains. In this study, we explored the potential of GLLMs,specifically DeepSeek V3.2 Exp in reasoning mode to perform practical protein engineering tasks without domain-specific biological training. Two representative design problems were addressed: Generation of amino acid sequences predicted to adopt the canonical 4- helix bundle topology, and targeted mutation design to improve protein solubility while preserving core structural integrity. Across 49 generated 4- helix bundle candidates, 40 adopted the desired geometry, with 36 achieving pLDDT scores above 70. Solubility optimization on 50 representative proteins yielded 46 mutants with an average predicted score increase of 0.178, and 29 maintained structural deviations below 3 Å RMSD. These results indicate that general-purpose LLMs such as DeepSeek V3.2 can integrate sequence–structure–property relationships sufficiently to produce viable protein designs. We propose a hybrid workflow that couples GLLM-based mutation generation with established computational validation, offering an accessible route for protein and peptide engineering.