Decoupled Representation Learning Improves Generalization in CRISPR Off-Target Prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Computational prediction of CRISPR-Cas9 off-target activity is essential for safe guide-RNA design, yet models trained on large proxy datasets often fail to generalize to experimentally validated sites. Methods We present a modular two-stage deep learning framework that separates sequence representation learning from off-target classification. In Stage 1, guide RNA sequences are encoded using frozen, pretrained DNABERT embeddings learned from large genomic corpora. In Stage 2, these embeddings are integrated with mismatch-level and pairwise sequence features within a hybrid CNN-Transformer classifier trained exclusively on a high-throughput proxy dataset. Results On the external TrueOT benchmark, a curated collection of low-throughput, experimentally confirmed off-target sites, the full model achieved a mean ROC-AUC of 0 . 70 ± 0 . 03 and a PR-AUC of 0 . 30 ± 0 . 03 , markedly surpassing the proxy-only baseline (ROC-AUC = 0.64, PR-AUC = 0.22). Ablation studies confirmed that the performance gain arises from the pretrained sequence representations rather than architectural complexity. Conclusions Decoupling representation learning from downstream classification and leveraging frozen transformer-based embeddings substantially improves generalization to biologically relevant off-target predictions. The proposed framework provides a reproducible baseline for the assessment of CRISPR-Cas9 risk and underscores the importance of transfer learning in the integration of proxy test data and experimental results in the real-world.