LLM-Skill Orchestration: Achieving 202/202 Subtask Completion via Rule-Augmented Multi-Model Collaboration in 50 Agentic Tasks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
LLM agents typically rely on a single model for multi-step tool-using tasks, creating a tension between required capability breadth and individual model limitations. We introduce LLM-Skill Orchestration, a three-layer architecture where: (1) a reasoning model generates orchestration rules from system constraints alone; (2) a planning model decomposes tasks into skill graphs with explicit dependencies; and (3) heterogeneous LLM-Skills — both pure-text and tool-equipped — execute in parallel through a shared context pool. We evaluate 50 agentic tasks across five types (information retrieval, code construction, cross-system analysis, multi-step reasoning, compound decision-making). Each task has 4–6 binary checklist items, totaling 202 items. The rule-augmented system (Hb) achieves 202/202 completion and 17.5/20 average quality (LLMas- Judge, σ=2.0), compared to 137/202 (68%) and 7.4/20 for the single-model baseline (A), and 166/202 (82%) and 13.7/20 for static-rule orchestration (C). Three ablation findings shape our understanding (5-task pilot study, used for relative comparisons only): (i) same-model decomposition (D: 8/22) performs worse than no decomposition (A: 13/22), proving that model diversity, not parallelism, drives collaborative gains; (ii) rule-blind generation (Hb: 96/100) outperforms rule-informed generation (Hi: 76/100), demonstrating that deductive reasoning from system invariants generalizes better than inductive learning from failure cases; (iii) 34 of 227 skills (15%) produced 0- byte output due to API anomalies across the full 50-task evaluation, yet all were autonomously compensated by the synthesis stage — an emergent architectural resilience not designed into the system.