Compression of protein secondary structures enables ultra-fast and accurate structure searching

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein structure prediction has undergone a revolution with the advent of AI-based algorithms, such as AlphaFold and RoseTTAFold. As a result, over 200 million predicted protein structures have been published. This wealth of structural data has created a need for rapid structure comparison algorithms, such as Foldseek, which enable efficient searches across this vast space of protein structures. Here we introduce a new ultra-compact representation of protein structure in the form of Secondary Structure Elements (SSEs). These are short sequences around 8% of the length and with 10% of the information content of full amino acid sequences and 3Di sequences. We show that, despite this compression factor, SSEs can be used as a highly effective tertiary structure comparison tool, with accuracy that approaches that of Foldseek, while offering a 200-fold speedup. In addition SSEs offer comparable performance to Foldseek in domain boundary retrieval. Furthermore we show that the particular way in which SSEs encode structure can also be used to specifically detect proteins that differ due to conformational change. These findings demonstrate that SSEs offer a valuable complementary approach for protein structure characterisation.

Article activity feed