Nucleotide GPT: Sequence-Based Deep Learning Prediction of Nuclear Subcompartment-Associated Genome Architecture
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The spatial organization of the genome within the nucleus is partially determined by its interactions with distinct nuclear subcompartments, such as the nuclear lamina and nuclear speckles, which play key roles in gene regulation during development. However, whether these genome-nuclear subcompartment interactions are encoded in the underlying DNA sequence remains poorly understood. The mechanisms for gene regulation are primarily encoded in noncoding DNA sequences, but deciphering how these sequence features control gene expression remains a significant challenge in genomics. Here, we present Nucleotide GPT, a transformer-based model that predicts genomic associations with spatially distinct, physical nuclear subcompartments from DNA sequence alone. Pre-trained on a diverse set of multi-species genomes, we demonstrate Nucleotide GPT’s genomic understanding through evaluation on diverse prediction tasks, including histone modifications, promoter detection, and transcription factor binding sites. When finetuned to predict genome interactions with two separate nuclear subcompartments – the lamina of the inner nuclear membrane and nuclear speckles that lie more interior – Nucleotide GPT achieves an average accuracy of 73.6% for lamina-associated domains (LADs) and 79.4% accuracy for speckle-associated domains (SPADs), averaged across three cortical development cell types. Analysis of the model’s learned representations through Uniform Manifold Approximation and Projection (UMAP) reveals that Nucleotide GPT develops internal embeddings that effectively distinguish LADs from inter-LADs, with predicted probabilities closely corresponding to experimentally determined LAD classifications. When examining these representations in the context of cell type-invariant constitutive LADs (cLADs) compared to cell type-specific LADs, the model assigns lower confidence scores to cell type-specific LADs compared to cLADs that are conserved across neuronal differentiation, suggesting sequence features may play a stronger role in maintaining cLAD associations. Examination of the model’s attention patterns at correctly classified regions suggests that specific sequence elements govern model decision making about nuclear subcompartment associations. Our results demonstrate the utility of transformer architectures for studying three-dimensional (3D) genome organization and substantiate a role for DNA sequence in determining nuclear subcompartment associations.