DomDiff: protein family and domain annotation via diffusion model and ESM2 embedding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate identification of conserved protein domain boundaries and their classification are fundamental to genome annotation, but are hindered by ambiguous boundaries, cross-domain interference, and limited samples for rare families. Here, we present DomDiff, a supervised conditional diffusion framework that reformulates the task as a generative process. Taking ESM2 embeddings, secondary structures, and biLSTM priors as inputs, it generates labels from Gaussian noise through iterative denoising, allowing coarse-to-fine optimization. We conducted a series of benchmark analyzes on publicly available protein sequence datasets, showing that DomDiff outperforms existing methods in domain boundary identification and classification, delivering performance gains of 12.6% in boundary detection and 4.2% in classification accuracy compared to other leading models. It excels particularly in annotating rare families, offering a powerful tool for specific applications such as large-scale genome annotation and functional characterization of novel proteins, thus providing a new paradigm for few-shot challenges in bioinformatics.

Article activity feed