AlphaFind: Discover structure similarity across the entire known proteome

David Prochazka
Terezia Slaninakova
Jaroslav Olha
Adrian Rosinec
Katarina Gresova
Miriama Janosova
Jakub Cillik
Jana Porubska
Radka Svobodova
Vlastislav Dohnal
Matej Antol

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

AlphaFind is a web-based search engine that provides fast structure-based retrieval in the entire set of AlphaFold DB structures. Unlike other protein processing tools, AlphaFind is focused entirely on tertiary structure, automatically extracting the main 3D features of each protein chain and using a machine learning model to find the most similar structures. This indexing approach and the 3D feature extraction method used by AlphaFind have both demonstrated remarkable scalability to large datasets as well as to large protein structures. The web application itself has been designed with a focus on clarity and ease of use. The searcher accepts any valid Uniprot ID, PDB ID or gene symbol as input, and returns a set of similar protein chains from AlphaFold DB, including various similarity metrics between the query and each of the retrieved results. In addition to the main search functionality, the application provides 3D visualizations of protein structure superpositions in order to allow researchers to instantly analyze the structural similarity of the retrieved results. The AlphaFind web application is available online for free and without any registration at https://alphafind.fi.muni.cz.

Arcadia Science
Mar 29, 2024

The search starts with acytochrome from corn (Zea Mays), and within the first 50 hits,we find similar structures originating from various animals(fish, eagle, mouse, cat, horse, etc.)

The phrase "within the first 50 hits" feels tantalizing. What else appeared among the top hits? Were there hits that were surprising or potentially false positives? And were there proteins that should have appeared among the top hits, but didn't?

Read the original source
Arcadia Science
Mar 29, 2024

Here, AlphaFind shows us (Figure 2)that highly similar hemoglobin structures can also be found inother species.

Again, it would be really great to quantify what "highly similar" means here.

Read the original source
Arcadia Science
Mar 29, 2024

in an average of 7 seconds withnegligible back-end load

It would be helpful to mention details about the hardware here, as the time cost is hard to interpret without that information.

Read the original source
Arcadia Science
Mar 29, 2024

Therefore, high occurrence of unstructuredregions in the input structure can bias the search. Thisphenomenon is more prevalent in coiled-coil structures but canbe also observed in some small structures

Again, it would be great to quantify this and/or to discuss some examples of proteins for which this is a real problem.

Read the original source
Arcadia Science
Mar 29, 2024

We tested AlphaFind on a diverse set of proteins varying insize, complexity, and quality. AlphaFind provided biologicallyrelevant results even for small, large and lower qualitystructures. When AlphaFind did not offer structures withhigh TM-scores, the results remained biologically relevant.

I think these claims would be more convincing if they could be quantified and if the performance of AlphaFind could be compared to other existing tools, if possible.

Read the original source
Arcadia Science
Mar 29, 2024

he latter two methodsin conjunction with (10) establish the basis of the indexingsolution presented in here

What is the relationship between this approach and approaches to indexing or similarity-based lookup used by common vector databases?

Read the original source
Arcadia Science
Mar 29, 2024

In the offline phase, we first extract semantic information fromraw cif files into vector embeddings,

It would be helpful to explain in more detail how this is done, since it seems like a crucial step.

Read the original source
Arcadia Science
Mar 1, 2024

Translating the input into a UniProt ID: AlphaFind supports three forms of input: UniProt ID, PBD ID, and Gene symbol. Since UniProt ID is internally used to identify a protein, other forms of input must be translated into UniProt ID using publicly available APIs. For PDB ID to UniProt ID conversion, we use: https://www.ebi.ac.uk/pdbe/api/mappings/uniprot/ and Gene symbol to UniProt ID conversion AlphaFind relies on: https://rest.uniprot.org/idmapping.

One of the main reasons I might use a structural search is if I have a novel protein that maybe isn't in UniProt. I don't know if it's possible with the way your tool works, but it could be a cool thing to think about for the future - is there a way to support user provided or determined PDBs that aren't in UniProt?

Read the original source
Arcadia Science
Mar 1, 2024

Figure

The examples are really great! It is a bit hard to really see what's happening in the overlays of all the structures. It might be helpful to see overlays for each hit protein with the the input as separate panels or something.

Read the original source
Arcadia Science
Mar 1, 2024

To address this issue, novel searching tools have been developed, e.g., FoldSeek (6), 3D-surfer (7)or Dali server (8).However their functionality has some substantial limitations: they cannot search through the whole AlphaFold DB, and they rely on predefined fold patterns.

I like that you brought up some of these other tools and described how your tool is different. Are there other benefits that the user might care about? For example, I noticed that the web tool is really fast! This is probably beyond the scope here, but a comparison of these different structure search tools would be useful.

Read the original source
Arcadia Science
Mar 1, 2024

Limitations

I appreciate the limitations section! I'm curious if there are plans to eventually incorporate the newer version of the AlphaFold database? Also wondering about things like the PDB database itself and the ESM metagenomic atlas?

Read the original source
Arcadia Science
Mar 1, 2024

https://alphafind.fi.muni.cz.

I really appreciate this super easy to use and fast web application tool for finding proteins similar to an input!

Read the original source
Version published to 10.1101/2024.02.15.580465 on bioRxiv
Feb 18, 2024

The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

This article has 1 author:
1. Mindaugas Margelevicius
This article has no evaluationsLatest version Jan 22, 2026
Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features

This article has 4 authors:
1. Tayyip Topuz
2. Zeki Erdem
3. Halil Bisgin
4. E. Demet Akten
This article has no evaluationsLatest version Feb 2, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

The Evolution of the AlphaFold Architecture

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

Feature-Optimized Machine Learning Benchmarking for Protein Interface Prediction in Permanent Homodimer Complexes with Distinct Structural Features