HieVi: Protein Large Language Model for proteome-based phage clustering

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Viral taxonomy is a challenging task due to the propensity of viruses for recombination. Recent updates from the ICTV and advancements in proteome-based clustering tools highlight the need for a unified framework to organize bacteriophages (phages) across multiscale taxonomic ranks, extending beyond genome-based clustering. Meanwhile, self-supervised large language models, trained on amino acid sequences, have proven effective in capturing the structural, functional, and evolutionary properties of proteins. Building on these advancements, we introduce HieVi, which uses embeddings from a protein language model to define a vector representation of phages and generate a hierarchical tree of phages. Using the INPHARED dataset of 24,362 complete and annotated viral genomes, we show that in HieVi, a multi-scale taxonomic ranking emerges that aligns well with current ICTV taxonomy. We propose that this method, unique in its integration of protein language models for viral taxonomy, can encode phylogenetic relationships, at least up to the family level. It therefore offers a valuable tool for biologists to discover and define new phage families while unraveling novel evolutionary connections.

Article activity feed