A genome-wide, machine learning-guided exploration of the cis-regulatory code involved in neuronal differentiation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Gene expression is controlled by proximal and distal cis-regulatory elements (CREs), containing DNA motifs bound by various transcription factors (TFs). Other sequence features, such as specific k-mers or low complexity regions, have also been implicated. However, in a dynamic biological process such as cell differentiation, we lack an understanding of how the transcriptional activity of CREs progressively change and what sequence features underlie these transitions, which may reflect common and/or coordinated regulatory processes. Here, we use single-cell ATAC-seq and RNA-seq to follow, at a genome scale, CREs along differentiation of induced pluripotent stem cells into cortical neurons and develop a method to automatically identify the diversity of CRE profiles and their underlying sequence features. We propose a machine-learning guided clustering algorithm, STOIC (Statistical learning TO Inform Clustering), that jointly learns an unsupervised clustering of the CREs in the space of the activity profiles and a supervised predictor associated with each cluster in the DNA-sequence space. This procedure explores the expression space and delineates the CRE clusters iteratively in order to optimize the performance of a supervised classifier predicting CRE cluster membership based on DNA sequence features. STOIC is specifically designed to provide readily interpretable results. We show that the method identifies CRE profiles associated with highly predictive sequence features and outperforms methods solely concerned with co-activity clustering on this task. Orthogonal data collected in the same settings link the inferred CRE clusters to specific enhancer or promoter signatures. Furthermore, we show that the DNA features unveiled by STOIC reflect biologically relevant regulators and offer a valuable basis to dissect elements of the cis-regulatory grammar. Finally, we demonstrate the general applicability of STOIC by analyzing five bulk CAGE datasets of human cells responding to various treatments.

Article activity feed