Adding 3Di characters to amino acid datasets can improve resolution, but the effect is weaker in shorter and alpha-helical proteins

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The recent introduction of Foldseek’s 3Di character alphabet to encode 3D protein structure has opened up new possibilities for structural phylogenetics. These characters, like protein structure, are more conserved than amino acids, raising the possibility of better resolution of very deep branches on the tree of life. As 3Di characters have a 20-letter alphabet, they are readily treatable with off-the-shelf algorithms for model-based phylogenetic inference and related methods such as bootstrapping. However, it remains to be seen if 3Di phylogenies are broadly more resolved than sequence-based phylogenies. We present data using samples from nine protein superfamilies showing that 3Di combines with sequence to produce better resolved phylogenies than either sequence or 3Di alone. We also show that information-theoretic measures, applied to superfamily alignments, significantly correlate with resolution in phylogenies derived from these alignments. Further, we identify the proportion of alpha helices in proteins as a major driver in reducing the information carried by 3Di character alignments, explaining the relatively poor performance of 3Di characters on superfamilies with highly-conserved structure but high alpha helical content. Our results provide encouragement for the further use of 3Di to address challenging questions in deep history, but also sound a note of caution about which proteins it is most suitable for.

SIGNIFICANCE

3Di characters have been suggested as a method to generate well-resolved deep phylogenies. Our results show that 3Di characters combined with sequences can improve resolution in the deepest nodes of protein superfamily trees. However, our results also show that 3Di characters may not be suitable for all protein types.

Article activity feed