AlphaFind: Discover structure similarity across the entire known proteome

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

AlphaFind is a web-based search engine that provides fast structure-based retrieval in the entire set of AlphaFold DB structures. Unlike other protein processing tools, AlphaFind is focused entirely on tertiary structure, automatically extracting the main 3D features of each protein chain and using a machine learning model to find the most similar structures. This indexing approach and the 3D feature extraction method used by AlphaFind have both demonstrated remarkable scalability to large datasets as well as to large protein structures. The web application itself has been designed with a focus on clarity and ease of use. The searcher accepts any valid Uniprot ID, PDB ID or gene symbol as input, and returns a set of similar protein chains from AlphaFold DB, including various similarity metrics between the query and each of the retrieved results. In addition to the main search functionality, the application provides 3D visualizations of protein structure superpositions in order to allow researchers to instantly analyze the structural similarity of the retrieved results. The AlphaFind web application is available online for free and without any registration at https://alphafind.fi.muni.cz.

Article activity feed

  1. The search starts with acytochrome from corn (Zea Mays), and within the first 50 hits,we find similar structures originating from various animals(fish, eagle, mouse, cat, horse, etc.)

    The phrase "within the first 50 hits" feels tantalizing. What else appeared among the top hits? Were there hits that were surprising or potentially false positives? And were there proteins that should have appeared among the top hits, but didn't?

  2. Here, AlphaFind shows us (Figure 2)that highly similar hemoglobin structures can also be found inother species.

    Again, it would be really great to quantify what "highly similar" means here.

  3. in an average of 7 seconds withnegligible back-end load

    It would be helpful to mention details about the hardware here, as the time cost is hard to interpret without that information.

  4. Therefore, high occurrence of unstructuredregions in the input structure can bias the search. Thisphenomenon is more prevalent in coiled-coil structures but canbe also observed in some small structures

    Again, it would be great to quantify this and/or to discuss some examples of proteins for which this is a real problem.

  5. We tested AlphaFind on a diverse set of proteins varying insize, complexity, and quality. AlphaFind provided biologicallyrelevant results even for small, large and lower qualitystructures. When AlphaFind did not offer structures withhigh TM-scores, the results remained biologically relevant.

    I think these claims would be more convincing if they could be quantified and if the performance of AlphaFind could be compared to other existing tools, if possible.

  6. he latter two methodsin conjunction with (10) establish the basis of the indexingsolution presented in here

    What is the relationship between this approach and approaches to indexing or similarity-based lookup used by common vector databases?

  7. In the offline phase, we first extract semantic information fromraw cif files into vector embeddings,

    It would be helpful to explain in more detail how this is done, since it seems like a crucial step.

  8. Translating the input into a UniProt ID: AlphaFind supports three forms of input: UniProt ID, PBD ID, and Gene symbol. Since UniProt ID is internally used to identify a protein, other forms of input must be translated into UniProt ID using publicly available APIs. For PDB ID to UniProt ID conversion, we use: https://www.ebi.ac.uk/pdbe/api/mappings/uniprot/ and Gene symbol to UniProt ID conversion AlphaFind relies on: https://rest.uniprot.org/idmapping.

    One of the main reasons I might use a structural search is if I have a novel protein that maybe isn't in UniProt. I don't know if it's possible with the way your tool works, but it could be a cool thing to think about for the future - is there a way to support user provided or determined PDBs that aren't in UniProt?

  9. Figure

    The examples are really great! It is a bit hard to really see what's happening in the overlays of all the structures. It might be helpful to see overlays for each hit protein with the the input as separate panels or something.

  10. To address this issue, novel searching tools have been developed, e.g., FoldSeek (6), 3D-surfer (7)or Dali server (8).However their functionality has some substantial limitations: they cannot search through the whole AlphaFold DB, and they rely on predefined fold patterns.

    I like that you brought up some of these other tools and described how your tool is different. Are there other benefits that the user might care about? For example, I noticed that the web tool is really fast! This is probably beyond the scope here, but a comparison of these different structure search tools would be useful.

  11. Limitations

    I appreciate the limitations section! I'm curious if there are plans to eventually incorporate the newer version of the AlphaFold database? Also wondering about things like the PDB database itself and the ESM metagenomic atlas?