Genome-wide Statistical and Machine Learning Analysis of Long-Range Sequence Memory Across Human Chromosomes 1, 2, and 22

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: The human genome displays heterogeneous patterns of sequence organization, including long-range correlations, GC-content fluctuations, and repetitive structures that influence chromatin state and regulatory activity. While many studies focus on motifs and short-range sequence features, genome-wide analyses of long-range memory, quantified through Hurst exponents, autocorrelation, entropy, and power-law scaling, remain limited. Determining whether these memory signatures differ between regulatory and non-regulatory regions can provide new biological insight and support improved computational annotation. Results: We analyzed human chromosomes 1, 2, and 22 using sliding 50-kb windows and computed a unified set of sequence-statistical features, including DFA-based Hurst exponents, power-law decay ($\alpha$), k-mer entropy, GC-content, multi-scale autocorrelation, and homopolymer run statistics. Regulatory windows consistently showed higher long-range memory, lower entropy, stronger short-range autocorrelation, and enriched AT/GC homopolymer runs compared with non-regulatory windows ($p < 0.001$). A Balanced Random Forest classifier trained on these features achieved strong predictive performance across all chromosomes (ROC–AUC 0.93–0.96). Feature-importance analysis indicated that homopolymer run lengths, k-mer entropy, autocorrelation, and Hurst exponents were the dominant predictors. Chromosome 22—the gene-sparse “dark genome”—exhibited the same memory signatures, demonstrating robustness in structurally diverse regions. Conclusions: Long-range sequence memory is a consistent and quantifiable property of the human genome. Regulatory regions display stronger memory signatures than background sequence, reflecting multi-kilobase compositional coherence that may support regulatory function. These results highlight the utility of memory-based descriptors for genomic annotation and provide a reproducible framework for studying genome structure, evolution, and regulatory architecture.

Article activity feed