ProteinDJ: a high-performance and modular protein design pipeline
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Leveraging artificial intelligence and deep learning to generate proteins de novo (a.k.a. ‘synthetic proteins’) has unlocked new frontiers of protein design. Deep learning models trained on protein structures can generate novel protein designs that explore structural landscapes unseen by evolution. This approach enables the development of bespoke binders that target specific proteins and domains through new protein-protein interactions. However, successful binder generation can suffer from low in silico success rates, often requiring thousands of designs and hundreds of GPU hours to obtain enough hits for experimental testing. While commercial web-apps and workstation implementations are available for binder design, these are limited in both scalability and throughput. There is a lack of efficient open-source protein design pipelines for high-performance computing (HPC) systems that can maximise hardware resources and parallelise the workflow efficiently to generate successful binders.
Here, we present ‘ProteinDJ’—an implementation of a synthetic protein design workflow that is deployable on HPC systems using the Nextflow portable workflow management system and Apptainer containerisation. It parallelises the workload across both GPUs and CPUs, facilitating generation and testing of hundreds of designs per hour, dramatically accelerating the discovery process. ProteinDJ is designed to be modular and currently includes RoseTTAFold Diffusion (RFdiffusion) for fold generation, ProteinMPNN or Full-Atom MPNN (FAMPNN) for sequence design, and AlphaFold2 or Boltz-2 for prediction and validation of designs and binder-target interfaces, with supporting packages for structural evaluation of designs. ProteinDJ democratises protein binder design through its robust and user-friendly implementation and provides a framework for future protein design software pipelines.