Protein structure alignment significance is often exaggerated

Robert C. Edgar
Harutyun Sahakyan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine learning has generated millions of high-quality predicted protein structures, creating a need for computationally efficient structure search algorithms and robust estimates of statistical significance at this scale. We show that unrelated proteins have a universal tendency towards convergent evolution of secondary and tertiary motifs, causing an excess of high-scoring false positive alignments. We investigate popular structure search and alignment algorithms, finding that previous methods routinely overestimate significance by up to six orders of magnitude. To address these issues, and to accommodate recent innovations in search algorithm design, we describe a novel method for estimating statistical significance. We show that its E -values are accurate, scale successfully with database size, and are robust against the (generally unknown) diversity of folds in the database.

We implement our approach in an online structure search service based on Reseek at https://reseek.online .

Version published to 10.1101/2025.07.17.665375 on bioRxiv
Jul 19, 2025

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

This article has 1 author:
1. Mindaugas Margelevicius
This article has no evaluationsLatest version Jan 22, 2026
The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
Deep Learning Approaches for Accurate RNA 3D Structure Prediction from Primary Sequences

This article has 1 author:
1. Nnaemeka Kingsley Ugwumba
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

The Evolution of the AlphaFold Architecture

Deep Learning Approaches for Accurate RNA 3D Structure Prediction from Primary Sequences