ConsultChain: Progressive Context Distillation Across Heterogeneous LLM Fleets for Token-Optimal Inference

Samuel Edusa

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: The operational cost of large language model (LLM) inference is dominated by token consumption. In heterogeneous model fleets where prices span two orders of magnitude, most systems still route all traffic to a single model or implement only binary cheap/expensive routing. This leaves significant cost optimization on the table, particularly for agentic workflows that re-process identical context across sessions. Methods: I designed and implemented ConsultChain, an architecture introducing three novel mechanisms: (1) a five-tier model cascade with progressive context distillation, where each tier compresses context before escalation rather than forwarding raw input, (2) knowledge crystallization, a self-healing persistent knowledge store modeled on physical crystal formation with nucleation, growth, and fracture phases, and (3) synaptic pruning, an activity-driven memory management system inspired by adolescent neural development that promotes frequently-used pathways and eliminates idle ones. The system was deployed on a Raspberry Pi 5 orchestrating a 10-model fleet accessed via cloud APIs and evaluated over simulated workloads of 50 requests per day. Results: The system achieves 98.5% token cost reduction at steady state ($1.74/month) compared to single-model full-context baselines ($112.50/month) on a fleet spanning a 227x price differential ($0.11/M to $25/M tokens). Costs compound downward over time: Tier 0 resolution rates increase from 40% at week 1 to 95% at week 24 as the knowledge lattice matures. The orchestration layer runs with under 1.2GB RAM and requires no local GPU. Conclusions: Progressive context distillation, combined with self-healing knowledge storage and activity-driven memory pruning, enables cost reductions that compound over time rather than remaining static. The approach is feasible on edge hardware and generalizes to any heterogeneous model fleet. The implementation is released as open source.

Version published to 10.21203/rs.3.rs-9368244/v1 on Research Square
Apr 13, 2026

Dimension-Direct Routing: Achieving 25% Depth Improvement in Multi- Model LLM Systems via Explicit Capability Factorization

This article has 1 author:
1. Tao Rui
This article has no evaluationsLatest version Apr 7, 2026
DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

This article has 5 authors:
1. Xuexian Li
2. Xiayuan Liu
3. Zilong Wang
4. Chun-Yao Hsieh
5. Yixue Liu
This article has no evaluationsLatest version Apr 7, 2026
From Inference-Time Routing to Ingestion-Time Graphs: Referential Discovery, Actor-Agent Parallelism, and Formal Completeness Guarantees for Deterministic Multi-Hop RAG

This article has 1 author:
1. Ruben Jaime
This article has no evaluationsLatest version Apr 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Dimension-Direct Routing: Achieving 25% Depth Improvement in Multi- Model LLM Systems via Explicit Capability Factorization

DeepServe: SLO-Aware and Cost-Aware Elastic Scheduling for Serverless Multi-Tenant LLM Inference

From Inference-Time Routing to Ingestion-Time Graphs: Referential Discovery, Actor-Agent Parallelism, and Formal Completeness Guarantees for Deterministic Multi-Hop RAG