Bi-level diversity optimisation for representative protein panel selection

Zhen Ou
Katherine James
Simon Charnock
Anil Wipat

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Selecting representative subsets from large protein sequence datasets is a common challenge in enzyme discovery and related tasks under limited screening capacity. In practice, candidate panels are often constructed using clustering-based redundancy reduction or manual selection guided by phylogenetic or similarity-network analyses, which do not directly optimise subset diversity and require threshold tuning or expert interpretation. Here, we present a bi-level diversity-optimisation framework for representative protein panel selection implemented using a local search heuristic that iteratively updates panel composition to improve diversity. The method formulates panel design as a combinatorial optimisation problem over pairwise distance matrices, combining a MaxMin objective to enforce minimum separation between selected sequences with a MaxSum objective to increase global dispersion. This formulation enables the direct construction of fixed-cardinality panels while remaining independent of the similarity representation used to compute pairwise distances. Benchmarking across four Pfam families shows that the bi-level formulation consistently reduces redundancy among selected sequences, lowering maximum pairwise identity by 43-46% relative to the previous MaxSum-based formulation, while maintaining comparable or improved EC-label coverage. The framework can incorporate sequence- or structure-based similarity measures, providing a flexible strategy for constructing diverse representative panels across homologous protein families.

Version published to 10.64898/2026.04.17.719243 on bioRxiv
Apr 21, 2026

TreeGazer: Prospecting Protein Sequence-Function Landscapes via Phylogenetic Structure

This article has 6 authors:
1. Sebastian Porras
2. Samuel Davis
3. Oscar Paredes Trujillo
4. Patrick Diep
5. Gerhard Schenk
6. Mikael Bodén
This article has no evaluationsLatest version May 17, 2026
Selecting genomes that matter: haplotype-based prioritization for iterative pangenome expansion

This article has 7 authors:
1. Marina P. Marone
2. Erwang Chen
3. Axel Himmelbach
4. Georg Haberer
5. Manuel Spannagl
6. Nils Stein
7. Martin Mascher
This article has no evaluationsLatest version May 18, 2026
Functional Profiling of Thousands of Sequence-Diverse Protease Homologs with GROQ-seq

This article has 9 authors:
1. James R. McLellan
2. Svetlana Ikonomova
3. Shwetha Sreenivasan
4. Alan N. Amin
5. Catherine Baranowski
6. Amanda Reider Apel
7. Peter Kelly
8. David Ross
9. Aviv Spinner
This article has no evaluationsLatest version May 5, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

TreeGazer: Prospecting Protein Sequence-Function Landscapes via Phylogenetic Structure

Selecting genomes that matter: haplotype-based prioritization for iterative pangenome expansion

Functional Profiling of Thousands of Sequence-Diverse Protease Homologs with GROQ-seq