CpG-induced regions associated compositional stratification of human essential proteins through mathematical genomics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
CpG islands are genomic regions enriched in cytosine–phosphate–guanine dinucleotides, typically associated with gene promoters and regulatory elements. While their role in transcriptional regulation is well established, their influence on protein sequence composition remains underexplored. In this study, 3,222 essential proteins from Homo sapiens were analyzed to investigate the impact of CpG island architecture on amino acid usage, sequence complexity, and chromosomal distribution. Codons containing both cytosine and guanine were mapped to five CpG-induced amino acids, and their relative abundance was quantified across proteins. CpG-induced regions were identified using a sliding window approach, and metrics such as average CpG density, longest CpG-induced region length, and sequence coverage were computed. Motif composition within these regions was characterized using polarity and charge-based indices. Clustering revealed consistent bipartite subgroup structures, indicating compositional stratification among essential proteins. Shannon entropy quantified compositional complexity, revealing a significant shift between clusters ( p = 5. × 50 10 −145 ). Cluster 1 proteins showed lower entropy and greater heterogeneity, whereas Cluster 2 showed higher entropy and tighter distributions, indicating distinct regimes of variability. Additionally, we examined CpG island overlap within coding exons of essential genes and found that CpG-depleted proteins are predominantly encoded by genes with minimal CGI coverage, whereas CpG-induced proteins span both low- and high-overlap classes. CGI-associated CpGs were consistently enriched toward the 5′ end of essential genes, indicating a positional bias in CpG distribution. These findings establish a quantitative framework linking CpG island distribution to proteomic architecture and offer a scalable strategy for motif annotation, epigenetic modeling, and functional stratification. Future applications include predictive modeling of protein function, disease association, and regulatory dynamics.