Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences

Jeffrey A. Ruffolo
Stephen Nayfach
Joseph Gallagher
Aadyot Bhatnagar
Joel Beazer
Riffat Hussain
Jordan Russ
Jennifer Yip
Emily Hill
Martin Pacesa
Alexander J. Meeske
Peter Cameron
Ali Madani

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Gene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as human cells. Artificial intelligence (AI) enabled design provides a powerful alternative with potential to bypass evolutionary constraints and generate editors with optimal properties. Here, using large language models (LLMs) trained on biological diversity at scale, we demonstrate the first successful precision editing of the human genome with a programmable gene editor designed with AI. To achieve this goal, we curated a dataset of over one million CRISPR operons through systematic mining of 26 terabases of assembled genomes and meta-genomes. We demonstrate the capacity of our models by generating 4.8x the number of protein clusters across CRISPR-Cas families found in nature and tailoring single-guide RNA sequences for Cas9-like effector proteins. Several of the generated gene editors show comparable or improved activity and specificity relative to SpCas9, the prototypical gene editing effector, while being 400 mutations away in sequence. Finally, we demonstrate an AI-generated gene editor, denoted as OpenCRISPR-1, exhibits compatibility with base editing. We release OpenCRISPR-1 publicly to facilitate broad, ethical usage across research and commercial applications.

Arcadia Science
Jul 12, 2024

Strikingly, the resulting landscape was dominated by generated proteins, which comprised 94.1% of the total phylogenetic diversity (as measured by cumulative branch length) and resulted in a 10.3-fold increase in diversity relative to the entire CRISPR-Cas Atlas (Fig. 2b). Novel phylogenetic groups were distributed across the tree, suggesting that the model has captured the full diversity of Cas9 and is not overfitting to any particular lineage.

I find it hard to interpret the importance of these results without more context.

For example, how surprising is it to see this enrichment given the initial n of natural and generated proteins?

How might decisions with respect to tree construction effect the branch length distribution? It seems possible that you would get different a different outcome if you varied the mmseqs parameters or …

Strikingly, the resulting landscape was dominated by generated proteins, which comprised 94.1% of the total phylogenetic diversity (as measured by cumulative branch length) and resulted in a 10.3-fold increase in diversity relative to the entire CRISPR-Cas Atlas (Fig. 2b). Novel phylogenetic groups were distributed across the tree, suggesting that the model has captured the full diversity of Cas9 and is not overfitting to any particular lineage.

I find it hard to interpret the importance of these results without more context.

For example, how surprising is it to see this enrichment given the initial n of natural and generated proteins?

How might decisions with respect to tree construction effect the branch length distribution? It seems possible that you would get different a different outcome if you varied the mmseqs parameters or implemented different criteria for choosing representative proteins.

Furthermore - though novel phylogenetic groups are distributed throughout the tree - it would be interesting to know if the overall distribution across clades is predicted by the abundance of natural proteins across the tree. I.e. do clades with more natural proteins in the training data tend to produce more generated proteins?

Read the original source
Arcadia Science
May 14, 2024

Table S1.

This table is pretty hard to use in this current format, would you consider a text version of it , like Table S2-S5 ?

Read the original source
Arcadia Science
May 14, 2024

These data were used to train a sequence-to-sequence gRNA model that conditionally generates crRNA and tracrRNA sequences for a given protein (Fig. 1a).

This is so cool that this worked!

Read the original source
Arcadia Science
May 14, 2024

After generating the full set of four million sequences, a series of filters were applied to ensure only realistic proteins were used to characterize the model’s generative capabilities.

It would be really useful to see the numbers of sequences that these different steps filtered out in getting you from 4 mill --> 2 mill.

Read the original source
Version published to 10.1101/2024.04.22.590591 on bioRxiv
Apr 22, 2024

How many genes can CRISPR edit to engineer complex adaptations?

This article has 7 authors:
1. Jinseul Kyung
2. Maliheh Esfahanian
3. Joseph Mann
4. Emily Koke
5. Keegan Pham
6. Yunru Peng
7. Moises Exposito-Alonso
This article has no evaluationsLatest version May 22, 2026
Programmable DNA integration with New-to-Nature tools using Computational Protein Design

This article has 10 authors:
1. Hailey M. Wallace
2. Seong Guk Park
3. Adam Smiley
4. Tuba Şevik
5. Shubham Dubey
6. Shirin Fatma
7. Samuel Chau-Duy-Tam Vo
8. Ethan Creed
9. Alexandre Zanghellini
10. Elizabeth H. Kellogg
This article has no evaluationsLatest version May 29, 2026
Fully Modified SpyCas9 Guide RNAs Enable Robust Genome Editing In Cells and In Vivo

This article has 27 authors:
1. Kim Anh Vu
2. Han Zhang
3. Nadia Amrani
4. Gitali Devi
5. Nicholas Gaston
6. Jonathan Lee
7. Stacy A. Maitland
8. Zexiang Chen
9. Dimas Echeverria
10. Pengpeng Liu
11. Karthikeyan Ponnienselvan
12. Matthew B. Hanlon
13. Connor Lucas
14. Kevin Luk
15. Jacquelyn Sousa
16. David Cooper
17. Alyxandr Srnka
18. Julia M. Rembetsy-Brown
19. Aditya Valji Ansodaria
20. Nathan Bamidele
21. Aamir Mir
22. Ken Yamada
23. Julia F. Alterman
24. Anastasia Khvorova
25. Scot A. Wolfe
26. Erik J. Sontheimer
27. Jonathan K. Watts
This article has no evaluationsLatest version May 28, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

How many genes can CRISPR edit to engineer complex adaptations?

Programmable DNA integration with New-to-Nature tools using Computational Protein Design

Fully Modified SpyCas9 Guide RNAs Enable Robust Genome Editing In Cells and In Vivo