Beyond profiles: supervised repeat annotation using protein embeddings

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Repeated sequence motifs give rise to diverse protein structures and functions, yet their detection is often challenging due to weak sequence similarity between repeat units. Most sensitive approaches rely on homology information directly represented by alignments or derived profiles, thereby limiting flexibility and scalability. Here, we introduce TREAD ( T ransfer learning-based RE peat A nnotation using Protein Embe D dings), a supervised framework that reformulates repeat detection as a residue-level annotation problem and operates directly on embeddings from protein language models. Instead of constructing explicit probabilistic profiles, TREAD learns repeat-specific features implicitly, enabling residue-resolved scoring and flexible repeat segment extraction. Across complementary benchmarks based on RepeatsDB and Pfam, TREAD consistently matches or outperforms the widely used profile-based tool HMMER, particularly in low-data and high-divergence settings. The model exhibits robustness to score thresholding and demonstrates better generalization to independent test sets, including those containing remote homologs. To illustrate the practical utility of TREAD, we applied it to survey β-propeller proteins across the AlphaFold Database and representative proteomes to generate a comprehensive census of this fold. This analysis highlights extensive propeller diversity, identifies lineage-specific expansion patterns across the tree of life, and suggests previously unrecognized relationships between propellers and other repeat folds. Together, TREAD provides a flexible and scalable alternative to profile-based repeat annotation and establishes a general motif-centric framework for protein sequence annotation.

Article activity feed