Design of highly functional genome editors by modeling the universe of CRISPR-Cas sequences
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Gene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as human cells. Artificial intelligence (AI) enabled design provides a powerful alternative with potential to bypass evolutionary constraints and generate editors with optimal properties. Here, using large language models (LLMs) trained on biological diversity at scale, we demonstrate the first successful precision editing of the human genome with a programmable gene editor designed with AI. To achieve this goal, we curated a dataset of over one million CRISPR operons through systematic mining of 26 terabases of assembled genomes and meta-genomes. We demonstrate the capacity of our models by generating 4.8x the number of protein clusters across CRISPR-Cas families found in nature and tailoring single-guide RNA sequences for Cas9-like effector proteins. Several of the generated gene editors show comparable or improved activity and specificity relative to SpCas9, the prototypical gene editing effector, while being 400 mutations away in sequence. Finally, we demonstrate an AI-generated gene editor, denoted as OpenCRISPR-1, exhibits compatibility with base editing. We release OpenCRISPR-1 publicly to facilitate broad, ethical usage across research and commercial applications.
Article activity feed
-
Strikingly, the resulting landscape was dominated by generated proteins, which comprised 94.1% of the total phylogenetic diversity (as measured by cumulative branch length) and resulted in a 10.3-fold increase in diversity relative to the entire CRISPR-Cas Atlas (Fig. 2b). Novel phylogenetic groups were distributed across the tree, suggesting that the model has captured the full diversity of Cas9 and is not overfitting to any particular lineage.
I find it hard to interpret the importance of these results without more context.
For example, how surprising is it to see this enrichment given the initial n of natural and generated proteins?
How might decisions with respect to tree construction effect the branch length distribution? It seems possible that you would get different a different outcome if you varied the mmseqs parameters or …
Strikingly, the resulting landscape was dominated by generated proteins, which comprised 94.1% of the total phylogenetic diversity (as measured by cumulative branch length) and resulted in a 10.3-fold increase in diversity relative to the entire CRISPR-Cas Atlas (Fig. 2b). Novel phylogenetic groups were distributed across the tree, suggesting that the model has captured the full diversity of Cas9 and is not overfitting to any particular lineage.
I find it hard to interpret the importance of these results without more context.
For example, how surprising is it to see this enrichment given the initial n of natural and generated proteins?
How might decisions with respect to tree construction effect the branch length distribution? It seems possible that you would get different a different outcome if you varied the mmseqs parameters or implemented different criteria for choosing representative proteins.
Furthermore - though novel phylogenetic groups are distributed throughout the tree - it would be interesting to know if the overall distribution across clades is predicted by the abundance of natural proteins across the tree. I.e. do clades with more natural proteins in the training data tend to produce more generated proteins?
-
Table S1.
This table is pretty hard to use in this current format, would you consider a text version of it , like Table S2-S5 ?
-
These data were used to train a sequence-to-sequence gRNA model that conditionally generates crRNA and tracrRNA sequences for a given protein (Fig. 1a).
This is so cool that this worked!
-
After generating the full set of four million sequences, a series of filters were applied to ensure only realistic proteins were used to characterize the model’s generative capabilities.
It would be really useful to see the numbers of sequences that these different steps filtered out in getting you from 4 mill --> 2 mill.
-