Minimal Amino Acid Alphabet for Protein Design
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Proteins are built from 20 canonical amino acids. It is interesting to explore whether proteins can be formed from significantly reduced amino acid alphabets. Our bioinformatics survey of UniProt (more than 250 M sequences) revealed that proteins composed of reduced amino acid alphabets (< 10) are extremely rare among existing proteins. Next, we used computational protein design to design proteins composed of all 1,013 possible alphabets of 2-10 early amino acids (Ala, Asp, Glu, Gly, Ile, Leu, Pro, Ser, Thr, and Val). The length of all proteins was 100 amino acid residues. Small amino acid alphabets preferred simple helices or helix bundles. Larger amino acid alphabets allowed for the design of more complex structures. A protein composed of 8 amino acids (Ala, Asp, Gly, Leu, Val, Ser, Thr, and Pro) was successfully experimentally verified. It belongs to fibronectin type III domain β-sheet-rich architecture. Attempts to experimentally verify designs composed of 6 and 4 amino acids were unsuccessful. We show by a computational experiment without an experimental validation that inverse folding programs, namely ProteinMPNN, can stabilize designed proteins within the same amino acid alphabet. Our results show that globular proteins may have formed early in evolution. Furthermore, we show that it is possible to design proteins with interesting properties for biotechnology and synthetic biology.
Article activity feed
-
UniProt
I really like starting this with a study of the existing proteins from UniProt with reduced amino acid alphabets. I'm wondering if one could do a similar thing with proteins form the PDB or if there are many of these reduced alphabet proteins that have experimentally determined structures? If they're underrepresented in the PDB, it's possible that ESMfold would be biased towards predicting lower pLDDT scores for these proteins or even inaccurate folds, which you do somewhat mention in the discussion.
-
Corynebacterium pyruviciproducens
This has me wondering about the phylogenetic makeup of these proteins in both this analysis and the one with the early amino acid. Are these proteins from fairly diverse organisms or are they clustered in specific organisms? For the early amino acids, do they come from proteins with deep branches or that have been around for a long time? I think that could add just a generally interesting bit of information to this analysis, but it's also related to the protein databases used to train ESMfold, which are notoriously phylogenetically biased.
-
Stabilizing (pro-design) amino acids were those whose presence in the alphabet tends to decrease the score or increase pLDDT.
This is very cool! I wonder how generalizable it is.
-
The shape of the ECD spectrum corresponds to a β-sheet-rich protein.
Wow! Very interesting result and very cool that you were able to make this de novo protein with only 8 amino acids.
-
Discussion
Very nice discussion, covers lots of limitations and strengths of this study.
-
desing
Small typo!
-
Detailed datasets (modified lm-design, examples of raw design data, all resulting sequences and predicted 3D structure, data used to generate images, and detailed experimental procedures) are available via Zenodo (DOI: 10.5281/zen-odo.18889431).
Amazing!
-
The purpose of this study is to test how many types of amino acid are needed to build a globular protein. First, we search for such proteins among modern proteins in UniProt. Next, we attempted to computationally design and experimentally evaluate such proteins.
This is such a cool paper! I learned a ton about early proteins and amino acids. I enjoyed it all, but I found the part where the authors leveraged this analysis to identify amino acids that promoted designability particularly interesting. Also, love the experimental validation!
-