GTalign: spatial index-driven protein structure alignment, superposition, and search

Mindaugas Margelevičius

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

With protein databases growing rapidly due to advances in structural and computational biology, the ability to accurately align and rapidly search protein structures has become essential for biological research. In response to the challenge posed by vast protein structure repositories, GTalign offers an innovative solution to protein structure alignment and search—an algorithm that achieves optimal superposition at high speeds. Through the design and implementation of spatial structure indexing, GTalign parallelizes all stages of superposition search across residues and protein structure pairs, yielding rapid identification of optimal superpositions. Rigorous evaluation across diverse datasets reveals GTalign as the most accurate among structure aligners while presenting orders of magnitude in speedup at state-of-the-art accuracy. GTalign’s high speed and accuracy make it useful for numerous applications, including functional inference, evolutionary analyses, protein design, and drug discovery, contributing to advancing understanding of protein structure and function.

Version published to 10.1038/s41467-024-51669-z
Aug 24, 2024
Arcadia Science
Jun 27, 2024

Section 4.17 describes the hardware configuration used for the benchmark tests, addressing the information you mentioned was missing in your previous comment. Section 4.19 details all GTalign settings, indicating that GTalign compiled with GPU support was used for all analyses unless otherwise specified. It's important to note that GTalign compiled with GPU support will not run on CPUs. The last paragraph of Section 2.2 mentions that GTalign utilized all three V100 GPUs for the SCOPe40 2.08, PDB20, and Swiss-Prot datasets, with further details provided in Supplementary Section S1. Figures 1 and 2 illustrate the results for the SCOPe40 2.08, PDB20, and Swiss-Prot datasets. Therefore, there is no ambiguity in the text and figures. The specifications for CPU usage for all other tools are detailed in Methods Sections 4.20 to 4.24. All this …

Section 4.17 describes the hardware configuration used for the benchmark tests, addressing the information you mentioned was missing in your previous comment. Section 4.19 details all GTalign settings, indicating that GTalign compiled with GPU support was used for all analyses unless otherwise specified. It's important to note that GTalign compiled with GPU support will not run on CPUs. The last paragraph of Section 2.2 mentions that GTalign utilized all three V100 GPUs for the SCOPe40 2.08, PDB20, and Swiss-Prot datasets, with further details provided in Supplementary Section S1. Figures 1 and 2 illustrate the results for the SCOPe40 2.08, PDB20, and Swiss-Prot datasets. Therefore, there is no ambiguity in the text and figures. The specifications for CPU usage for all other tools are detailed in Methods Sections 4.20 to 4.24. All this information is provided in the text, so please refer to the full text rather than just excerpts.

Read the original source
Arcadia Science
Jun 27, 2024

Figures 1 and 2 provide a compact representation of multiple results. There is no confusion as long as the figure captions clearly state what the vertical lines represent. As previously mentioned, it would be incorrect to annotate the x-axis as '# top hits with TM-score > 0.5', since the curves and lines for the tools differ: some fall within the range of TM-score > 0.5, while others are in the range of TM-score < 0.4. The left panels of Figures 1 and 2 are referenced within sections 4.19 to 4.24, which provide detailed information in the Methods section. The Results and Discussion section provides an overview of the results. The left panel illustrates the correlation between a tool's measure and alignment accuracy (Fig. 7), while the middle panel compares the tools solely based on alignment accuracy. Vertical lines indicating the …

Figures 1 and 2 provide a compact representation of multiple results. There is no confusion as long as the figure captions clearly state what the vertical lines represent. As previously mentioned, it would be incorrect to annotate the x-axis as '# top hits with TM-score > 0.5', since the curves and lines for the tools differ: some fall within the range of TM-score > 0.5, while others are in the range of TM-score < 0.4. The left panels of Figures 1 and 2 are referenced within sections 4.19 to 4.24, which provide detailed information in the Methods section. The Results and Discussion section provides an overview of the results. The left panel illustrates the correlation between a tool's measure and alignment accuracy (Fig. 7), while the middle panel compares the tools solely based on alignment accuracy. Vertical lines indicating the threshold where the TM-score reaches 0.5 can only be drawn in the middle panel.

Read the original source
Arcadia Science
Jun 26, 2024

4.17 says that GPU hardware is available on the server used.

4.19 says that a GPU-supported version of GTalign was used, but does not say that GTalign was ran on GPUs.

While it's abundantly clear that GTalign can utilize GPUs for speed increases, as it is demonstrated in Supplementary Table S1, it's never made clear whether GTalign is using GPU in the speed tests in Figures 1 and 2, or whether a more "apples to apples" comparison (CPU-only) is used.

Here's a case where it's ambiguous whether TM-align is being parallelized on 40 CPUs or whether they are both being parallelized on 40 CPUs.

GTalign is up to 104–1424x faster (618–8454 vs. 879,965 seconds, Swiss-Prot dataset) than TM-align parallelized on all 40 CPU threads

This would be helpful:

GTalign on 3 V100 GPUs is up to 104–1424x faster (618–8454 vs. 879,965 seconds, Swiss-Prot …

4.17 says that GPU hardware is available on the server used.

4.19 says that a GPU-supported version of GTalign was used, but does not say that GTalign was ran on GPUs.

While it's abundantly clear that GTalign can utilize GPUs for speed increases, as it is demonstrated in Supplementary Table S1, it's never made clear whether GTalign is using GPU in the speed tests in Figures 1 and 2, or whether a more "apples to apples" comparison (CPU-only) is used.

Here's a case where it's ambiguous whether TM-align is being parallelized on 40 CPUs or whether they are both being parallelized on 40 CPUs.

GTalign is up to 104–1424x faster (618–8454 vs. 879,965 seconds, Swiss-Prot dataset) than TM-align parallelized on all 40 CPU threads

This would be helpful:

GTalign on 3 V100 GPUs is up to 104–1424x faster (618–8454 vs. 879,965 seconds, Swiss-Prot dataset) than TM-align parallelized on all 40 CPU threads

It would be helpful if this sentence referenced Figures 1 and 2:

This feature was effectively leveraged for processing the SCOPe40 2.08, PDB20, and Swiss-Prot datasets, where GTalign exploited the computational power of all three Tesla V100 GPUs available on the system Figures 1 & 2

Read the original source
Arcadia Science
Jun 26, 2024

Ahh thank you for this explanation and for taking the time to respond. The left panels and middle panels are so similar (especially (c) and (d)) that I mistook the traces as identical.

I had read the caption, but it was unclear enough for me to not understand. I'm just one datapoint, but more I think there's room for improvement. In particular, by saying "the vertical lines indicate the number of alignments with a TM-score > 0.5", you are making the x-axis do double duty: when referencing the vertical lines it is the # of hits with TM-score > 0.5, but when referencing the traces it is the # of top hits. Confusing!

Please take or leave this suggestion for increasing the readability in the caption:

(...) against the number of top alignments sorted by a tool’s measure (TM-score, Z-score, or P-value). In contrast, alignments in the middle …

Ahh thank you for this explanation and for taking the time to respond. The left panels and middle panels are so similar (especially (c) and (d)) that I mistook the traces as identical.

I had read the caption, but it was unclear enough for me to not understand. I'm just one datapoint, but more I think there's room for improvement. In particular, by saying "the vertical lines indicate the number of alignments with a TM-score > 0.5", you are making the x-axis do double duty: when referencing the vertical lines it is the # of hits with TM-score > 0.5, but when referencing the traces it is the # of top hits. Confusing!

Please take or leave this suggestion for increasing the readability in the caption:

(...) against the number of top alignments sorted by a tool’s measure (TM-score, Z-score, or P-value). In contrast, alignments in the middle panel are sorted by their (TM-align-obtained) TM-score. Vertical lines demarcate hits with TM-scores above (left) and below (right) 0.5.

From what I can tell, the Results & Discussion section never mentions the results in the left panel, instead deferring to the middle panel in each instance. This, coupled with the visual similarity to the traces sorted by TM-score, has me wondering just how complementary (and necessary) the left panel really is.

By the way, I really like this cumulative TM-score metric.

Read the original source
Arcadia Science
Jun 25, 2024

The question is not about a server or desktop, and even less about a dataset. Technological progress has reached a point where a recent desktop-grade GPU is more computationally capable than three V100 server-grade GPUs.

Read the original source
Arcadia Science
Jun 25, 2024

GTalign is user-friendly and easy to install from precompiled binaries. For Windows, an installer is available. For Linux, simply download the latest release and run GTalign by typing, for example, Linux_installer_GPU/bin/gtalign. It is compatible with all NVIDIA GPUs released since 2012. A conda package, which would provide additional convenience for macOS users, will be prepared in the near future.

Read the original source
Arcadia Science
Jun 25, 2024

Please refer to Methods section 4.17 and the first paragraph of section 4.19, which states that all analyses, unless otherwise specified, were performed using the GPU version of GTalign. All parameters are detailed in Methods section 4.19. Additionally, see the last paragraph of Section 2.2 for a discussion of runtime details.

Read the original source
Arcadia Science
Jun 25, 2024

(1) The middle panel shows the progression of the cumulative TM-score after the alignments are sorted by their TM-score. This differs from the left panel, where alignments are sorted by a tool's measure. Therefore, the curves are not identical and provide complementary information. (2) Labeling the x-axis as '# hits with TM-score > 0.5' is incorrect because figures 1 and 2 include a portion of top alignments with TM-scores < 0.5. This information (points 1 and 2) is detailed in the figure caption and discussed in the text. Please read all the accompanying text.

Read the original source
Arcadia Science
Jun 21, 2024

The right panel shows the cumulative TM-score plotted against runtime in seconds

My apologies if I missed this, but I was expecting to find a section in the Methods section that explained what hardware was used for the right panels. In particular, I was curious whether GTalign was ran in CPU-only mode, or whether GPUs were used. Maybe some details could be added either as a section in the Methods section or as a quick description within the Figure 1 caption.

Read the original source
Arcadia Science
Jun 21, 2024

user-friendly nature

I think GTalign could be made user-friendly by creating simpler install instructions. In my opinion, that is likely the largest barrier preventing its use in the scientific world. See this issue for details: https://github.com/minmarg/gtalign_alpha/issues/1

Read the original source
Arcadia Science
Jun 21, 2024

Notably, the desktop-grade machine, housing a more recent and affordable GeForce RTX 4090 GPU, outpaced the server with three Tesla V100 GPU cards when running GTalign. The detailed runtimes for each GTalign parameterized variant on these diverse machines are presented in Table S5.

This is very surprising. Is there a dataset size at which the server starts to eek out performance gains?

Read the original source
Arcadia Science
Jun 21, 2024

In the middle panel, the alignments are sorted by their (TM-align-obtained) TM-score. Vertical lines indicate the number of alignments with a TM-score ≥ 0.5. The arrow denotes the largest difference in that number between GTalign (732,024) and Foldseek (13,371)

The middle panel presents the data in a way that I've never seen before, and I had quite a difficult time wrapping my head around. I think my confusion boils down to these two main concerns: (1) Why are the curves in the left panels repeated in the middle panels? and (2) I think it is incorrect to label the x-axis as "# top hits". I would have understood this plot right away if the curves were removed and the x-axis label was replaced with "# hits with TM-score > 0.5".

Read the original source
Version published to 10.21203/rs.3.rs-3820640/v1 on Research Square
Jan 18, 2024
Version published to 10.1101/2023.12.18.572167 on bioRxiv
Dec 18, 2023

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

This article has 1 author:
1. Mindaugas Margelevicius
This article has no evaluationsLatest version Jan 22, 2026
The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
Deep Learning Approaches for Accurate RNA 3D Structure Prediction from Primary Sequences

This article has 1 author:
1. Nnaemeka Kingsley Ugwumba
This article has no evaluationsLatest version Jan 29, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

The Evolution of the AlphaFold Architecture

Deep Learning Approaches for Accurate RNA 3D Structure Prediction from Primary Sequences