Protein Embeddings and Local Alignments
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
The advent of protein embeddings has revolutionized bioinformatics by providing contextual representations that capture functional and evolutionary patterns. They have become, alongside sequence alignments, the cornerstone of bioinformatics. While embeddings cannot replace alignments, they can greatly help improving their quality. Our goal is replacing the BLOSUM matrices, the decades-old standard scoring system for protein alignment, with an embedding-based scoring method.
Results
We introduce a new scoring function and algorithm for local alignment of protein sequences, we offer a new comprehensive framework for evaluating local alignments. The score between two residues is given by the cosine similarity of their Ankh-embedding vectors and the algorithm uses dynamic programming with affine penalty. For the evaluation, we built multiple datasets, using both natural and inserted sequences, from the Conserved Domain Database, BAL-iBASE, and GPCRdb, designed a new algorithm for local alignment extraction, localization and quality evaluation, and employed five distance metrics to evaluate the similarity with the true alignment. We performed nearly one and a half million tests to compare the new algorithm with the best BLOSUM matrices, specialized GPCRtm matrices, and top programs, such as PEbA, ProtT5-score, DEDAL, vcMSA and pLM-BLAST. Regarding the protein embedding models, Ankh not only surpasses the best combination of ProtT5 and ESM2, but appears to better understand the “language” of proteins, as it behaves much better on natural sequences compared to artificial ones obtained by inserting domains in random protein sequences. Also, while ProtT5 and ESM2 combine to produce better results, Ankh does not combine well with other embeddings.
Conclusions
The new Ankh-score-based program is vastly superior to the BLOSUM matrices and clearly superior to all existing methods. New light shed on the protein embeddings can guide future improvements. In order to facilitate the use of the new method and protocol, they are freely available as a web server at e-score.csd.uwo.ca and as source code at github.com/lucian-ilie/E-score .