Automated Semantic State-Layout Synthesis for Generated Lexers:Structural Evaluation with Rollback-Aware Semantics
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Generated lexical analyzers are shaped not only by the DFA transition graph they implement, but also by ordinary code-generation choices such as how states are numbered and how that numbering is interpreted at run time. In practice, a state identifier often carries more meaning than simple acceptance: it may encode token family, priority, auxiliary status, and, under maximal-munch scanning, whether the state matters for rollback or last-accept recovery. Many generators recover that meaning from incidental numeric identifiers through lookup tables or ad hoc conditionals. This paper asks whether that semantic layer can itself be synthesized. More precisely, we study how to construct a semantics-preserving numeric layout for DFA states together with a comparison-based classifier over state identifiers, while leaving the recognized language, transition graph, and longest-match behavior unchanged. After formalizing the problem, we prove basic structural properties and present constructive procedures for contiguous class layouts, profile-guided interval ordering, and hierarchical comparison-tree layouts. We then evaluate these layouts structurally on ten lexer-oriented benchmarks ranging from 13 to 1368 states, under semantic labelings that range from binary acceptance to fine token-family distinctions with rollback-aware variants. Across 12,600 completed generated configurations, a consistent picture emerges. Flat multi-class interval layouts are always realizable, but their worst-case comparison depth grows with the number of semantic classes. Profile-aware ordering leaves interval complexity unchanged while reducing expected depth substantially when the profile matches deployment behavior, especially for finer labelings. Hierarchical comparison trees sharply reduce worst-case depth, particularly on the larger synthetic token-trie benchmarks, though they also generate larger classifier code. Rollback-aware semantic splitting improves rollback-oriented predicates at a modest semantic cost in the coarse and operational settings.