Minimal Amino Acid Alphabet for Protein Design

Karel Půbal
Kseniia Kushnir
Vojtěch Spiwok
Karolína Loužecká
Vladimír Setnička
Petra Lipovová

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Proteins are built from 20 canonical amino acids. It is interesting to explore whether proteins can be formed from significantly reduced amino acid alphabets. Our bioinformatics survey of UniProt (more than 250 M sequences) revealed that proteins composed of reduced amino acid alphabets (< 10) are extremely rare among existing proteins. Next, we used computational protein design to design proteins composed of all 1,013 possible alphabets of 2-10 early amino acids (Ala, Asp, Glu, Gly, Ile, Leu, Pro, Ser, Thr, and Val). The length of all proteins was 100 amino acid residues. Small amino acid alphabets preferred simple helices or helix bundles. Larger amino acid alphabets allowed for the design of more complex structures. A protein composed of 8 amino acids (Ala, Asp, Gly, Leu, Val, Ser, Thr, and Pro) was successfully experimentally verified. It belongs to fibronectin type III domain β-sheet-rich architecture. Attempts to experimentally verify designs composed of 6 and 4 amino acids were unsuccessful. We show by a computational experiment without an experimental validation that inverse folding programs, namely ProteinMPNN, can stabilize designed proteins within the same amino acid alphabet. Our results show that globular proteins may have formed early in evolution. Furthermore, we show that it is possible to design proteins with interesting properties for biotechnology and synthetic biology.

Arcadia Science
Mar 13, 2026

UniProt

I really like starting this with a study of the existing proteins from UniProt with reduced amino acid alphabets. I'm wondering if one could do a similar thing with proteins form the PDB or if there are many of these reduced alphabet proteins that have experimentally determined structures? If they're underrepresented in the PDB, it's possible that ESMfold would be biased towards predicting lower pLDDT scores for these proteins or even inaccurate folds, which you do somewhat mention in the discussion.

Read the original source
Arcadia Science
Mar 13, 2026

Corynebacterium pyruviciproducens

This has me wondering about the phylogenetic makeup of these proteins in both this analysis and the one with the early amino acid. Are these proteins from fairly diverse organisms or are they clustered in specific organisms? For the early amino acids, do they come from proteins with deep branches or that have been around for a long time? I think that could add just a generally interesting bit of information to this analysis, but it's also related to the protein databases used to train ESMfold, which are notoriously phylogenetically biased.

Read the original source
Arcadia Science
Mar 13, 2026

Stabilizing (pro-design) amino acids were those whose presence in the alphabet tends to decrease the score or increase pLDDT.

This is very cool! I wonder how generalizable it is.

Read the original source
Arcadia Science
Mar 13, 2026

The shape of the ECD spectrum corresponds to a β-sheet-rich protein.

Wow! Very interesting result and very cool that you were able to make this de novo protein with only 8 amino acids.

Read the original source
Arcadia Science
Mar 13, 2026

Discussion

Very nice discussion, covers lots of limitations and strengths of this study.

Read the original source
Arcadia Science
Mar 13, 2026

desing

Small typo!

Read the original source
Arcadia Science
Mar 13, 2026

Detailed datasets (modified lm-design, examples of raw design data, all resulting sequences and predicted 3D structure, data used to generate images, and detailed experimental procedures) are available via Zenodo (DOI: 10.5281/zen-odo.18889431).

Amazing!

Read the original source
Arcadia Science
Mar 13, 2026

The purpose of this study is to test how many types of amino acid are needed to build a globular protein. First, we search for such proteins among modern proteins in UniProt. Next, we attempted to computationally design and experimentally evaluate such proteins.

This is such a cool paper! I learned a ton about early proteins and amino acids. I enjoyed it all, but I found the part where the authors leveraged this analysis to identify amino acids that promoted designability particularly interesting. Also, love the experimental validation!

Read the original source
Version published to 10.64898/2026.03.06.710107 on bioRxiv
Mar 6, 2026

Propedia 26: An expanded and updated database of protein-peptide interactions for machine learning applications

This article has 9 authors:
1. Diego Mariano
2. Adenilson Arcanjo
3. Leonardo Henrique Silva
4. Milenna Machado Pirovani
5. Leandro Morais
6. Luana Luiza Bastos
7. Rafael Pereira Lemos
8. Pedro Martins
9. Raquel Cardoso de Melo-Minardi
This article has no evaluationsLatest version Feb 27, 2026
Evolutionarily distinct binding of amino terminal hotspots of the human THAP protein family and its homologs

This article has 2 authors:
1. HIRAL SANGHAVI
2. Gautam Sah
This article has no evaluationsLatest version Jan 27, 2026
Sequence alignment and 3D structure similarity searches are necessary to refine phylogeny trees for the identification of gene ancestors: the case of IGF system

This article has 3 authors:
1. Sophie Fouchécourt
2. Isabelle Callebaut
3. Philippe Monget
This article has no evaluationsLatest version Feb 9, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Propedia 26: An expanded and updated database of protein-peptide interactions for machine learning applications

Evolutionarily distinct binding of amino terminal hotspots of the human THAP protein family and its homologs

Sequence alignment and 3D structure similarity searches are necessary to refine phylogeny trees for the identification of gene ancestors: the case of IGF system