DNABERT2-CAMP: A Hybrid Transformer-CNN Model for E. coli Promoter Recognition

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Accurate recognition of promoter sequences in Escherichia coli is fundamental for understanding gene regulation and engineering synthetic biological systems. However, existing computational methods struggle to simultaneously model long-range genomic dependencies and fine-grained local motifs, particularly the degenerate −10 and −35 elements of σ70 promoters. To address this gap, we propose DNABERT2-CAMP, a novel hybrid deep learning framework designed to integrate global contextual understanding with high-resolution local motif detection for robust promoter identification. Methods: We constructed a balanced dataset of 8720 experimentally validated and negative 81-bp sequences from RegulonDB, literature, and the E. coli K-12 genome. Our model combines a pre-trained DNABERT-2 Transformer for global sequence encoding with a custom CAMP module (CNN-Attention-Mean Pooling) for local feature refinement. We evaluated performance using 5-fold cross-validation and an independent external test set, reporting standard metrics including accuracy, ROC AUC, and Matthews correlation coefficient (MCC). Results: DNABERT2-CAMP achieved 93.10% accuracy and 97.28% ROC AUC in cross-validation, outperforming existing methods including DNABERT. On an independent test set, it maintained strong generalization (89.83% accuracy, 92.79% ROC AUC). Interpretability analyses confirmed biologically plausible attention over canonical promoter regions and CNN-identified AT-rich/-35-like motifs. Conclusions: DNABERT2-CAMP demonstrates that synergistically combining pre-trained Transformers with convolutional motif detection significantly improves promoter recognition accuracy and interpretability. This framework offers a powerful, generalizable tool for genomic annotation and synthetic biology applications.

Article activity feed