CombinGym: a benchmark platform for machine learning-assisted design of combinatorial protein variants
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Combinatorial mutagenesis is essential for exploring protein sequence-function landscapes in engineering applications. However, while large-scale machine learning benchmarks exist for protein function prediction, they are primarily limited to single-mutant libraries, leaving a critical gap for combinatorial mutagenesis. Here we introduce CombinGym, a benchmarking platform featuring 14 curated combinatorial mutagenesis datasets spanning 9 proteins with diverse functional properties including binding affinity, fluorescence, and enzymatic activities. We evaluated nine machine learning algorithms from five methodological categories (alignment-based, protein language, structure-based, sequence-label, and substitution-based) across multiple prediction tasks, assessing both zero-shot and supervised learning performance using Spearman’s ρ and Normalized Discounted Cumulative Gain metrics. Our analysis reveals the substantial impact of measurement noise and data processing strategies on model performance. By implementing hierarchical dataset splits (0-vs-rest, 1-vs-rest, 2-vs-rest, and 3-vs-rest scenarios), we demonstrate the value of lower-order mutation data for empowering machine learning models to predict higher-order mutant properties. We validated this capacity through both in silico simulation (improving fluorescence brightness of an oxygen-independent fluorescent protein) and experimental validation (engineering enzyme substrate specificity), achieving a substantial increase in specific activity. All datasets, benchmarks, and metrics are available through an interactive website ( https://www.combingym.org ), facilitating collaborative dataset expansion and model development through integration with automated biofoundry platforms.