Genome-wide Statistical and Machine Learning Analysis of Long-Range Sequence Memory Across Human Chromosomes 1, 2, and 22
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: The human genome displays heterogeneous patterns of sequence organization, including long-range correlations, GC-content fluctuations, and repetitive structures that influence chromatin state and regulatory activity. While many studies focus on motifs and short-range sequence features, genome-wide analyses of long-range memory, quantified through Hurst exponents, autocorrelation, entropy, and power-law scaling, remain limited. Determining whether these memory signatures differ between regulatory and non-regulatory regions can provide new biological insight and support improved computational annotation. Results: We analyzed human chromosomes 1, 2, and 22 using sliding 50-kb windows and computed a unified set of sequence-statistical features, including DFA-based Hurst exponents, power-law decay ($\alpha$), k-mer entropy, GC-content, multi-scale autocorrelation, and homopolymer run statistics. Regulatory windows consistently showed higher long-range memory, lower entropy, stronger short-range autocorrelation, and enriched AT/GC homopolymer runs compared with non-regulatory windows ($p < 0.001$). A Balanced Random Forest classifier trained on these features achieved strong predictive performance across all chromosomes (ROC–AUC 0.93–0.96). Feature-importance analysis indicated that homopolymer run lengths, k-mer entropy, autocorrelation, and Hurst exponents were the dominant predictors. Chromosome 22—the gene-sparse “dark genome”—exhibited the same memory signatures, demonstrating robustness in structurally diverse regions. Conclusions: Long-range sequence memory is a consistent and quantifiable property of the human genome. Regulatory regions display stronger memory signatures than background sequence, reflecting multi-kilobase compositional coherence that may support regulatory function. These results highlight the utility of memory-based descriptors for genomic annotation and provide a reproducible framework for studying genome structure, evolution, and regulatory architecture.