snATAC-Express infers Gene Expression from Prioritized Chromatin Accessibility Peaks using Machine Learning

Margaret Brown
Alessandro Ferrari
Anne Dodd
Fang Shi
Vasantha L. Kolachala
Subra Kugathasan
Russell D. Wolfinger
Greg Gibson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Single cell multi-omic investigation opens-up new opportunities to understand mechanisms of gene regulation. Existing methods for inferring transcript abundance from chromatin accessibility fail to prioritize the most relevant peaks and tend to assume positive associations between ATAC peaks and RNA counts. We hypothesize that gene regulation can be modeled as a function of combined positive and negative interactions among peaks and that causal regulatory variants are enriched in the vicinity of the most critical peaks.

Results

A machine learning pipeline leveraging single nuclear multiomic transcriptome and chromatin accessibility data is developed to model gene expression as a function of ATAC peak intensity. Multiome data was available for 18 immune cell types from 29 donors, 19 with Crohn’s disease. The pipeline aggregates results from three machine learning approaches (random forest regression, XGBoost, and Light GBM) as well as linear regression to identify which ATAC peaks contribute to explaining variation among donors and cell types in pseudobulk gene expression. The coefficient of determination with cross-validation was used to identify robust models which typically explain between 5% and 40% of transcript abundance, utilizing on average 47% of the ATAC peaks, representing a significant gain in predictive accuracy. The most important peaks are enriched in GWAS variants for inflammatory bowel disease and the autoimmune disease systemic lupus erythematosus, but not for rheumatoid arthritis.

Conclusion

Atlanta Plots visualize the proportion of ATAC peaks contributing to a predictive model of gene expression as well as the proportion of variance explained by the model. Software implementing our pipeline, “snATAC-Express”, is freely available on GitHub.

Version published to 10.1101/2025.07.25.666784 on bioRxiv
Jul 25, 2025

Integrative Bioinformatics Analysis Unveils Neuro-cancer Crosstalk- related Genes and Establishes Prognostic Risk Model in Glioblastoma

This article has 7 authors:
1. Lin Zeng
2. Dingjun Li
3. Mengyu Du
4. Tao Wu
5. Yun Liao
6. Yuxing Huang
7. Xingyu Liao
This article has no evaluationsLatest version Jan 12, 2026
Integrative Transcriptomics and Machine Learning Identify Key Predictive Genes and Pathways in Celiac Disease

This article has 2 authors:
1. Amir Mahdi Taghizadeh
2. Yasin Soflaei
This article has no evaluationsLatest version Jan 7, 2026
Super-Enhancer Network in Metastatic Prostate Cancer: A Bioinformatics and Experimental Approach

This article has 3 authors:
1. maria mahmoudi
2. Mehdi Moghanibashi
3. Mostafa Ghaderi-Zefrehei
This article has no evaluationsLatest version Dec 18, 2025

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusion

Article activity feed

Related articles

Integrative Bioinformatics Analysis Unveils Neuro-cancer Crosstalk- related Genes and Establishes Prognostic Risk Model in Glioblastoma

Integrative Transcriptomics and Machine Learning Identify Key Predictive Genes and Pathways in Celiac Disease

Super-Enhancer Network in Metastatic Prostate Cancer: A Bioinformatics and Experimental Approach