Affinity Map: Few-Shot Protein Family Classification via Prototypical Networks: Benchmarking Sequence Encoders and Episodic ESM-2 Fine-Tuning

Mohamed Deraz Nasr

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein family annotation is a cornerstone of computational biology, yet the acquisition of large, curated per-family corpora is laborious and often infeasible for rare families. We present Affinity Map, a meta-learning pipeline that frames protein family classification as a few-shot learning problem: given only K labelled examples from a previously unseen family, the model must correctly assign new sequences to that family. We systematically benchmark encoder quality under this episodic framework, ranging from a lightweight 1D-CNN trained from scratch through compositional k-mer baselines to a frozen ESM-2 protein language model and episodic LoRA fine-tuning, all evaluated under Prototypical Networks with N-way K-shot tasks sampled from the Pfam database. Evaluating on 24 held-out test families reveals: (1) CNN ProtoNet trained from scratch reaches 71.0% at K=5; (2) 3-mer frequency k-mer ProtoNet reaches 86.2%; (3) a frozen ESM-2 encoder reaches 88.7% at K=5; and (4) episodic LoRA fine-tuning of ESM-2 reveals a K-dependent interaction: LoRA gains +2.5 pp over frozen ESM-2 at K=1 (p < 0.001), but underperforms frozen ESM-2 at K >= 2, indicating that episodic adaptation improves single-shot retrieval at the cost of multi-shot prototype quality. All pairwise CNN vs. baseline differences are statistically significant (paired Wilcoxon, p < 0.001). Real per-epoch learning curves, a named confusion matrix, PCA/UMAP embedding visualisations, and comprehensive baseline comparisons provide biologically interpretable diagnostics throughout. All code and results are publicly available.

Version published to 10.21203/rs.3.rs-9194664/v1 on Research Square
Mar 24, 2026

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

This article has 3 authors:
1. Antoine Claude Lemor
2. Shannon Dinan
3. Jeremy Gilbert
This article has no evaluationsLatest version Apr 13, 2026
LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

This article has 3 authors:
1. Antoine Claude Lemor
2. Shannon Dinan
3. Jeremy Gilbert
This article has no evaluationsLatest version Apr 13, 2026
GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

This article has 7 authors:
1. Bing Rao
2. Jie Bai
3. Maha A. Thafar
4. Somayah Albaradei
5. Kamran Arshad
6. Apilak Worachartcheewanh
7. Muhammad Arif
This article has no evaluationsLatest version Mar 26, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

LLM Tool: A Hybrid Pipeline for Automated High-Throughput Text Annotation Using Local Language Models and BERT Classifiers

GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model