A Multi-modal LLM for Dynamic Protein-Ligand Interactions and Generative Molecular Design
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
BioDynaGen (Biological Dynamics and Generation) is a novel multi-modal framework unifying protein sequences, dynamic binding site conformations, small molecule ligand SMILES, and natural language text into a single discrete token representation. Built upon a general large language model, BioDynaGen employs continuous pre-training and instruction fine-tuning via next-token prediction to address critical gaps in modeling protein dynamics and ligand interactions. This framework enables a diverse range of tasks, including small molecule-protein binding prediction, dynamic pocket design, and ligand-assisted functional generation. By comprehensively integrating these modalities, BioDynaGen offers an advanced framework for understanding and designing complex biological molecular interactions.
Article activity feed
-
First, generating a textual analysis of a binding site based on a protein sequence.Second, predicting a plausible binding site conformation given a specific ligand.Third, synthesizing a functional description by integrating the protein sequence, the predicted conformation, and the ligand information.
Nice breakdown tbh!
-
Output: A natural language answer describing the protein’s function, activity, or binding mode under the specified ligand conditions.
To what extent is this accurate / aligned with biological reality? Does generating natural language answers introduce a source of error/confounding? What happens as answers become shorter vs longer vs less/more complex?
-
A central challenge is learning aligned and effective representations across different data types, such as learning effective binary descriptors that can maintain group fairness [27].
To what extent has this been solved by better molecular representations? Proteins and ligands are still molecules, and wouldn't atom-level representations ensure consistency across these data types? Boltz/BoltzGen does leverage atom-level information...
-
SE(3)-invariant encoder combined with a temporal-aware VQ-VAE style quantization module. This allows us to convert diverse binding pocket conformations (e.g., apo, holo, or intermediate states) into discrete tokens, effectively capturing their dynamic variations. Furthermore, we integrate standard SMILES string tokenization for small molecules, alongside specialized amino acid tokens and the native Llama3 text tokenizer, expanding the LLM’s vocabulary to encompass these crucial biological entities.
Why the Llama3 tokenizer among all other choices? Seems odd methodologically? Why not something designed for this kind of purpose? https://arxiv.org/html/2409.15370v1
-