Harnessing machine learning models for epigenome to transcriptome association studies

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding how epigenome variation contributes to gene expression in disease and development is a fundamental challenge. Regulatory regions show cell type-specific epigenome activity and differ in their location, size, and distance to their target genes, complicating discovery and analysis. Recent machine learning models have been proposed to address these problems by learning functions for the prediction of gene expression from epigenomic data. Here, we use the large IHEC EpiATLAS dataset to benchmark state-of-the-art linear and non-linear approaches. Each approach is optimized for over 28,000 human genes, providing a comprehensive regulatory catalog of gene models. In-depth comparison reveals that gene characteristics and the epigenomic complexity of the locus influence the difficulty of predicting the epigenome-to-transcriptome association. The model performance is further evaluated using CRISPRi and eQTL validation data. Based on these models, we conduct histone-acetylation association studies in a systematic way to investigate how epigenomic variation impacts gene expression. The model-based analysis revealed genes and regulatory regions linked to B-cell leukemia in patient data with known disease-related functions. Our work provides a foundation for applications that link epigenome variation to gene expression in human cells, by benchmarking methods on a per-gene basis, illustrating their use in a disease context and making trained models available to the community.

Article activity feed