ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: The operational cost of large language model (LLM) inference is dominated by token consumption. In heterogeneous model fleets where prices span two orders of magnitude, most systems still route all traffic to a single model or implement only binary cheap/expensive routing. This leaves significant cost optimization on the table, particularly for agentic workflows that re-process identical context across sessions. Methods: I designed and implemented ConsultChain, an architecture introducing three novel mechanisms: (1) a five-tier model cascade with progressive context distillation, where each tier compresses context before escalation rather than forwarding raw input, (2) knowledge crystallization, a self-healing persistent knowledge store modeled on physical crystal formation with nucleation, growth, and fracture phases, and (3) synaptic pruning, an activity-driven memory management system inspired by adolescent neural development that promotes frequently-used pathways and eliminates idle ones. The system was deployed on a Raspberry Pi 5 orchestrating a 10-model fleet accessed via cloud APIs and evaluated over simulated workloads of 50 requests per day. Results: The system achieves 98.5% token cost reduction at steady state ($1.74/month) compared to single-model full-context baselines ($112.50/month) on a fleet spanning a 227x price differential ($0.11/M to $25/M tokens). Costs compound downward over time: Tier 0 resolution rates increase from 40% at week 1 to 95% at week 24 as the knowledge lattice matures. The orchestration layer runs with under 1.2GB RAM and requires no local GPU. Conclusions: Progressive context distillation, combined with self-healing knowledge storage and activity-driven memory pruning, enables cost reductions that compound over time rather than remaining static. The approach is feasible on edge hardware and generalizes to any heterogeneous model fleet. The implementation is released as open source.