Structure-aware protein sequence alignment using contrastive learning

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Protein alignment is a critical process in bioinformatics and molecular biology. Despite structure-based alignment methods being able to achieve desirable performance, only a very small number of structures are available among the vast of known protein sequences. Therefore, developing an efficient and effective sequence-based protein alignment method is of significant importance. In this study, we propose CLAlign, which is a structure-aware sequence-based protein alignment method by using contrastive learning. Experimental results show that CLAlign outperforms the state-of-the-art methods by at least 12.5% and 24.5% on two common benchmarks, Malidup and Malisam.

Article activity feed

  1. Structure-aware protein sequence alignment using contrastive learning

    I found this study very interesting and creative! Fine-tuning the embedding space to account for structural similarity via contrastive learning seems like a wonderful idea and the results are very impressive. Here are some of my thoughts about your paper, presented in no particular order. Please feel free to take or leave any of my suggestions.

    One advantage of CLAlign compared to structural aligners is that you don't need to calculate structures. However, the hardware requirements for CLAlign are probably non-trivial, since pLM embeddings have to be calculated. Hardware requirements are missing from the manuscript so it's hard to know. Relatedly, there is no information provided about the speed of CLAlign. I think the manuscript should be expanded to include detailed runtime statistics and hardware requirements so that CLAlign can be better benchmarked against the other tools.

    While Table 1 gives us an overall picture of the alignment quality, it would be nice to know the tool's strengths and weaknesses. How does it perform when sequences are distant homologs? Or when there are large length mismatches? Since embedding-based alignments are state-of-the-art, this kind of information would be broadly useful for readers.

    Figure 1 looks more like a draft than a complete figure. And without a caption it doesn't make sense.

    The performance is very impressive, and it has me curious how much further the performance could be improved simply by increasing the epochs or training dataset. Visualizing the loss curve could help contextualize the performance and help readers understand the extent to which there is room for improvement.

    Small notes:

    • Throughout the manuscript, consistent reference to pLMs is made without any specificity. But there are many different architectures, e.g. BERT, T5, autoregressive, etc. I found this confusing.

    • There are many grammatical mistakes. Consider passing the manuscript through a grammar checker.

    Final thoughts:

    Great work! I am curious to try CLAlign once it is made available.