Efficient Protein Engineering via Integrated Language Models and Bayesian Optimization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study investigates the application of advanced predictive models to reduce the cost and effort associated with protein engineering campaigns. We explore the use of protein language models (PLMs), a variant of large language models (LLMs), to predict functional performance from protein sequences. A common challenge in this domain is the scarcity of functional data. To address this, we examine zero-shot and few-shot learning methods. Another challenge is efficiently searching the vast fitness landscape for superior protein variants. We evaluate search methods, such as Bayesian optimization, to tackle this problem. The proposed methods are evaluated against a benchmark of 34 protein datasets containing sequences and their quantified functional values. Our findings demonstrate the potential of these advanced predictive models to streamline and accelerate the protein engineering process.