Atacformer: A transformer-based foundation model for analysis and interpretation of ATAC-seq data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Chromatin accessibility profiling is an important tool for understanding gene regulation and cellular function. While public repositories house nearly 10,000 scATAC-seq experiments, unifying this data for meaningful analysis remains challenging. Existing tools struggle with the scale and complexity of scATAC-seq datasets, limiting tasks like clustering, cell-type annotation, and reference mapping. A promising solution is using foundation models adapted to specific tasks via transfer learning. While transfer learning has been applied to scRNA-seq, its potential for scATAC-seq remains underexplored.

Methods

We introduce Atacformer, a transformer-based foundation model for scATAC-seq data analysis. Unlike other models that only produce cell-level representations, Atacformer generates embeddings for individual cis-regulatory elements. Pre-trained on a large atlas of scATAC-seq experiments, Atacformer learns robust representations of genomic regulatory regions for downstream use. After pretraining, the model is fine-tuned for cell-type prediction and batch correction. We also integrated Atacformer with RNA-seq data to build a Contrastive RNA-ATAC Fine Tuning (CRAFT) model capable of cross-modal alignment and RNA imputation from ATAC data.

Results

Atacformer matches or exceeds leading scATAC-seq clustering tools in adjusted rand index and runtime, with fine-tuned models achieving top performance across datasets. It processes raw fragment files end-to-end 80% faster than existing tools while preserving biological structure. Fine-tuned on bulk BED files, it recovers cell type and assay labels with > 80% accuracy. We show how the Atacformer architecture produces contextualized embeddings of individual genomic regions, which we use to identify unannotated, cell-type-specific promoter elements directly from chromatin accessibility data.

Article activity feed