Matching Frontier Code Agents with Lightweight Models via Multi-Model Consultation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Frontier performance on code benchmarks is assumed to require frontier models—but does it? We demonstrate that Claude Haiku 4.5, a lightweight model, achieves 74.6% ON SWE-BENCH VERIFIED, matching Claude 4.5 Opus (74.4%) at 62% LOWER COST PER INSTANCE. Our approach combines two complementary strategies: a baseline single-agent and multi-model consultation via Polydev MCP, which queries GPT 5.2 Codex and Gemini 3 Flash Preview. The best single policy achieves 66.6% (95% CI: [62.3%, 70.6%]); taking the best result from either policy yields Resolve@2: 74.6% (95% CI: [70.5%, 78.3%]). KEY INSIGHT: The approaches exhibit only 76% overlap in solved instances (Jaccard \(J = 0.76\)), with 24% of successes coming from one approach succeeding where the other failed. McNemar’s test shows no systematic dominance (\(p = 0.29\)), indicating balanced bidirectional complementarity. Consultation helps most for multi-file changes (78.2%) and ambiguous requirements (84.7%), but can add noise for simple fixes. Our results suggest inference-time scaling—through agent turns, extended thinking, and model diversity—can substitute for training-time model scale. Code, predictions, and reasoning trajectories for all 500 instances: https://github.com/backspacevenkat/polydev-swe-bench.

Article activity feed