Decoupled Representation Learning Improves Generalization in CRISPR Off-Target Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Computational prediction of CRISPR-Cas9 off-target activity is essential for safe guide-RNA design, yet models trained on large proxy datasets often fail to generalize to experimentally validated sites. Methods We present a modular two-stage deep learning framework that separates sequence representation learning from off-target classification. In Stage 1, guide RNA sequences are encoded using frozen, pretrained DNABERT embeddings learned from large genomic corpora. In Stage 2, these embeddings are integrated with mismatch-level and pairwise sequence features within a hybrid CNN-Transformer classifier trained exclusively on a high-throughput proxy dataset. Results On the external TrueOT benchmark, a curated collection of low-throughput, experimentally confirmed off-target sites, the full model achieved a mean ROC-AUC of 0 . 70   ±   0 . 03 and a PR-AUC of 0 . 30   ±   0 . 03 , markedly surpassing the proxy-only baseline (ROC-AUC = 0.64, PR-AUC = 0.22). Ablation studies confirmed that the performance gain arises from the pretrained sequence representations rather than architectural complexity. Conclusions Decoupling representation learning from downstream classification and leveraging frozen transformer-based embeddings substantially improves generalization to biologically relevant off-target predictions. The proposed framework provides a reproducible baseline for the assessment of CRISPR-Cas9 risk and underscores the importance of transfer learning in the integration of proxy test data and experimental results in the real-world.

Article activity feed