snATAC-Express infers Gene Expression from Prioritized Chromatin Accessibility Peaks using Machine Learning
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Single cell multi-omic investigation opens-up new opportunities to understand mechanisms of gene regulation. Existing methods for inferring transcript abundance from chromatin accessibility fail to prioritize the most relevant peaks and tend to assume positive associations between ATAC peaks and RNA counts. We hypothesize that gene regulation can be modeled as a function of combined positive and negative interactions among peaks and that causal regulatory variants are enriched in the vicinity of the most critical peaks.
Results
A machine learning pipeline leveraging single nuclear multiomic transcriptome and chromatin accessibility data is developed to model gene expression as a function of ATAC peak intensity. Multiome data was available for 18 immune cell types from 29 donors, 19 with Crohn’s disease. The pipeline aggregates results from three machine learning approaches (random forest regression, XGBoost, and Light GBM) as well as linear regression to identify which ATAC peaks contribute to explaining variation among donors and cell types in pseudobulk gene expression. The coefficient of determination with cross-validation was used to identify robust models which typically explain between 5% and 40% of transcript abundance, utilizing on average 47% of the ATAC peaks, representing a significant gain in predictive accuracy. The most important peaks are enriched in GWAS variants for inflammatory bowel disease and the autoimmune disease systemic lupus erythematosus, but not for rheumatoid arthritis.
Conclusion
Atlanta Plots visualize the proportion of ATAC peaks contributing to a predictive model of gene expression as well as the proportion of variance explained by the model. Software implementing our pipeline, “snATAC-Express”, is freely available on GitHub.