A Pāṇinian Grammar of the Human Genome: The Genomic Periodicity Index Encodes Functional Architecture and Evolutionary Innovation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We describe a three-level formal grammar of the human genome that emerges directly from DNA sequence without prior biological annotation. At the first level, the genome maintains a universal structural periodicity of approximately 400 base pairs — an alternation of repeat and unique sequence that we term the Genomic Periodicity Index (GPI). Local deviations from this periodicity, the GPI Deviation (GPID), identify positions where functional requirements override structural packaging: 81.4% of GPID breaks across all 23 human chromosomes overlap known functional elements (p = 3×10⁻¹²⁷ versus a 30% genome-wide baseline), a principle conserved across human, chimpanzee, and gorilla (89%, 75%, and 68%, respectively).At the second level, dinucleotide composition resolves into five universal sequence classes across all chromosomes tested (mean similarity 0.96, p = 4.65×10⁻⁸), each mapping to a distinct biological function without annotation: the CpG-rich, promoter-associated class, enriched 3.72-fold at GPID breaks, marks sites of transcription initiation; the LINE-rich, AT-rich class marks structurally rigid scaffold sequence. At the third level, transition rules between sequence classes define forbidden adjacencies and reveal systematic compositional changes at transcription start sites, transcription end sites, and splice junctions.Positions combining the structurally rigid sequence class with GPID breaks show consistent enrichment for pathogenic variants (OR = 2.64, p = 10⁻³²), consistent with a model in which local sequence entropy predicts variant intolerance. A genome-wide grammar model trained on 1,116,212 windows across all 23 chromosomes independently recovers the Tridosha biochemical hierarchy from dinucleotide composition alone: the Pitta class (CpG-rich, GC-content 0.307) concentrates on gene-dense chromosomes (chr19, chr22) and is associated with metabolic and cardiovascular disease; the Kapha class (AT-rich, repeat-dense) is associated with structural tumour suppressor and DNA repair disorders; the Vāta class with neural and movement disorders — a correspondence with Āyurvedic clinical taxonomy derived without prior annotation. Across primate evolution, the GPI at three loci of human-specific disease burden — APOE, HBB, and CFTR — is 10–20-fold longer in human than in other primates, reflecting lineage-specific reorganization of regulatory architecture; nine sites of human-specific regulatory deletion coincide with loss of GPID breaks, marking sequence grammar changes associated with human trait divergence. Derived from raw DNA sequence alone — without biochemical assay, evolutionary alignment, or prior annotation — the GPI framework reveals the human genome as a formal positional grammar encoding regulatory identity, evolutionary constraint, and human-specific biology.