Explaining how mutations affect AlphaFold predictions
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Transformer models, neural networks that learn context by identifying relationships in sequential data, underpin many recent advances in artificial intelligence. Nevertheless, their inner workings are difficult to explain. Here, we find that a transformer model within the AlphaFold architecture uses simple, sparse patterns of amino acids to select protein conformations. To identify these patterns, we developed a straightforward algorithm called Conformational Attention Analysis Tool (CAAT). CAAT identifies amino acid positions that affect AlphaFold’s predictions substantially when modified. These effects are corroborated by experiments in several cases. By contrast, modifying amino acids ignored by CAAT affects AlphaFold predictions less, regardless of experimental ground truth. Our results demonstrate that CAAT successfully identifies the positions of some amino acids important for protein structure prediction, narrowing the search space required to predict effective mutations and suggesting a framework that can be applied to other transformer-based neural networks.
Article activity feed
-
AF overpredicted the dimer conformation substantially
It seems pertinent to establish why the dimer conformation is predicted in XCL1. It would be valuable to run a structural alignment of both XCL1 conformations against the AF2/3 training dataset.
This would reveal several things. First, which XCL1 conformations are in the training dataset, if any? Either being present would be considered data leakage. And second, how many hits correspond to each of the conformations?
My hypothesis is that either (a) the XCL1 dimer is present in the training dataset and the chemokine isn't, or (b) neither/both are present, but the dimer yields significantly more hits, creating a dimer preference for XCL1 and all of its derived "ancestors".
Depending on the dataset size (I forget how much clustering the AF folks did), the alignment could be feasibly …
AF overpredicted the dimer conformation substantially
It seems pertinent to establish why the dimer conformation is predicted in XCL1. It would be valuable to run a structural alignment of both XCL1 conformations against the AF2/3 training dataset.
This would reveal several things. First, which XCL1 conformations are in the training dataset, if any? Either being present would be considered data leakage. And second, how many hits correspond to each of the conformations?
My hypothesis is that either (a) the XCL1 dimer is present in the training dataset and the chemokine isn't, or (b) neither/both are present, but the dimer yields significantly more hits, creating a dimer preference for XCL1 and all of its derived "ancestors".
Depending on the dataset size (I forget how much clustering the AF folks did), the alignment could be feasibly conducted using TMAlign. Otherwise, foldseek or other scalable aligners would work.
-
AF overpredicted the dimer conformation substantially
It might be valuable to check which conformation (if not both) were included in the original model training datasets.
-
XCL1 attention heads displayed an interaction network unique to the dimer fold (Figure 2B). Using an interpretation strategy originally suggested by the AF team (C), this network is characterized by vertical lines corresponding to interacting amino acids (Figure S5A,B).
It's interesting to see how the key residues in these attention maps interact globally with the total sequence. This feels somewhat distinct from the results of Zhang et al. on the categorical Jacobian which picks up strong pairwise patterns between amino acids (predicting the contact map of a folded sequence). I wonder if this pattern is a unique feature of these fold-switching proteins or a general phenomenon in Alphafold.
-