Automated Semantic State-Layout Synthesis for Generated Lexers:Structural Evaluation with Rollback-Aware Semantics

Reda Belaiche

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Generated lexical analyzers are shaped not only by the DFA transition graph they implement, but also by ordinary code-generation choices such as how states are numbered and how that numbering is interpreted at run time. In practice, a state identifier often carries more meaning than simple acceptance: it may encode token family, priority, auxiliary status, and, under maximal-munch scanning, whether the state matters for rollback or last-accept recovery. Many generators recover that meaning from incidental numeric identifiers through lookup tables or ad hoc conditionals. This paper asks whether that semantic layer can itself be synthesized. More precisely, we study how to construct a semantics-preserving numeric layout for DFA states together with a comparison-based classifier over state identifiers, while leaving the recognized language, transition graph, and longest-match behavior unchanged. After formalizing the problem, we prove basic structural properties and present constructive procedures for contiguous class layouts, profile-guided interval ordering, and hierarchical comparison-tree layouts. We then evaluate these layouts structurally on ten lexer-oriented benchmarks ranging from 13 to 1368 states, under semantic labelings that range from binary acceptance to fine token-family distinctions with rollback-aware variants. Across 12,600 completed generated configurations, a consistent picture emerges. Flat multi-class interval layouts are always realizable, but their worst-case comparison depth grows with the number of semantic classes. Profile-aware ordering leaves interval complexity unchanged while reducing expected depth substantially when the profile matches deployment behavior, especially for finer labelings. Hierarchical comparison trees sharply reduce worst-case depth, particularly on the larger synthetic token-trie benchmarks, though they also generate larger classifier code. Rollback-aware semantic splitting improves rollback-oriented predicates at a modest semantic cost in the coarse and operational settings.

Version published to 10.21203/rs.3.rs-9238282/v1 on Research Square
Mar 31, 2026

Grammar-Guided Incremental Method for Efficient LLM-Generated Code Execution

This article has 2 authors:
1. Anton Svystunov
2. Yaroslav Tereshchenko
This article has no evaluationsLatest version Apr 2, 2026
Exact Pattern-Aware Extraction for Equality Saturation via Bounded-Depth Tree Covering

This article has 3 authors:
1. Zi Cheng
2. Mengting Yuan
3. Lefei Zhang
This article has no evaluationsLatest version Apr 9, 2026
A Compositional Exact-Search Calculus over a Single Binary Primitive: Bounded Enumeration, Macro Closure, and Stability Audits for EML Trees

This article has 1 author:
1. Anoop Madhusudanan
This article has no evaluationsLatest version Apr 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Grammar-Guided Incremental Method for Efficient LLM-Generated Code Execution

Exact Pattern-Aware Extraction for Equality Saturation via Bounded-Depth Tree Covering

A Compositional Exact-Search Calculus over a Single Binary Primitive: Bounded Enumeration, Macro Closure, and Stability Audits for EML Trees