Organ-System Disease Identity Is Encoded in the Physical Grammar of Regulatory DNA
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Regulatory DNA is not a passive sequence of binding sites. It is a structured physical medium governed by transformation rules—a grammar, in the formal sense—that encodes the organ-system identity of every gene it controls. Here we formalise this grammar using the framework of Pāṇini (c. 4th century BCE), whose classification of Sanskrit phonemes by their physical articulatory properties, and derivation of transformation rules for their junctions, constitutes the earliest known generative grammar. Applied to DNA, the same logical structure—physical classification of units, junction transformation rules, ordered derivational path—generates a predictive theory of how regulatory sequence encodes biological specificity and how its disruption produces disease. We compute a high-dimensional grammatical feature space (the 64-Kalā) from the promoter sequences of human genes, capturing G-quadruplex density, CpG architecture, thermodynamic gradient sharpness, palindromic organisation, and transposable element composition. This grammar alone classifies genes into three constitutional regulatory principles—Doshas, the organ-system archetypes of Ayurvedic medicine—corresponding to distinct physical identities: Kapha, Pitta and Vāta—well above random expectation, without protein annotation or chromatin data. We define a per-position grammatical fragility score that quantifies how sensitively a regulatory sequence responds to disruption at each position. GWAS variants at the most fragile positions show strong constitutional concordance—the match between gene regulatory identity and disease organ-system. Independent validation of ClinVar regulatory variants confirms the zone architecture of the grammar: the medial promoter zone shows high concordance, while the proximal core promoter shows no signal, consistent with purging by purifying selection. Positions at splicing junctions (Splice Sandhi) are strongly enriched for pathogenic variants. Across nearly thirty thousand gnomAD variants, fragile positions are significantly depleted for common variants—confirming that evolutionary pressure protects grammar-critical positions across the human population. Grammar errors at fragile positions produce directed shifts in regulatory Dosha identity: disruptions are misdirections toward a predictable alternative organ-system programme, not random failures. Across the full cis-regulatory landscape, intronic enhancer variants at maximal fragility reach near-complete Dosha concordance. Cross-population replication in four independent gnomAD super-populations—including African lineages predating the out-of-Africa dispersal—confirms that this constraint is universal to the human species. Five independent lines of evidence constitute a coherent physical theory of regulatory disease: grammar errors are misdirections, not deletions, and the direction of misdirection is encoded in the physical character of the disrupted junction. The field does not need better pattern recognition. It needs a grammar. No existing variant scoring tool addresses organ-system specificity—this grammar does, from sequence alone.