Matching Frontier Code Agents with Lightweight Models via Multi-Model Consultation

Venkata Subrhmanyam Ghanta
Pujitha Sri Lakshmi Paladugu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Frontier performance on code benchmarks is assumed to require frontier models—but does it? We demonstrate that Claude Haiku 4.5, a lightweight model, achieves 74.6% ON SWE-BENCH VERIFIED, matching Claude 4.5 Opus (74.4%) at 62% LOWER COST PER INSTANCE. Our approach combines two complementary strategies: a baseline single-agent and multi-model consultation via Polydev MCP, which queries GPT 5.2 Codex and Gemini 3 Flash Preview. The best single policy achieves 66.6% (95% CI: [62.3%, 70.6%]); taking the best result from either policy yields Resolve@2: 74.6% (95% CI: [70.5%, 78.3%]). KEY INSIGHT: The approaches exhibit only 76% overlap in solved instances (Jaccard \(J = 0.76\)), with 24% of successes coming from one approach succeeding where the other failed. McNemar’s test shows no systematic dominance (\(p = 0.29\)), indicating balanced bidirectional complementarity. Consultation helps most for multi-file changes (78.2%) and ambiguous requirements (84.7%), but can add noise for simple fixes. Our results suggest inference-time scaling—through agent turns, extended thinking, and model diversity—can substitute for training-time model scale. Code, predictions, and reasoning trajectories for all 500 instances: https://github.com/backspacevenkat/polydev-swe-bench.

Version published to 10.32388/qg4n5o
Jan 22, 2026

PRIME: Policy-Reinforced Iterative Multi-Agent Execution for Algorithmic Reasoning in Large Language Models

This article has 6 authors:
1. Jiawei Xu
2. Zhenyu Yu
3. Ziqian Bi
4. Minh Duc Pham
5. Xiaoyi Qu
6. Danyang Zhang
This article has no evaluationsLatest version Jan 20, 2026
Towards a Science of Scaling Agent Systems

This article has 20 authors:
1. Yubin Kim
2. Ken Gu
3. Chanwoo Park
4. Chunjong Park
5. Samuel Schmidgall
6. A. Ali Heydari
7. Yao Yan
8. Zhihan Zhang
9. Yuchen Zhuang
10. Yun Liu
11. Mark Malhotra
12. Paul Liang
13. Hae Won Park
14. Yuzhe Yang
15. Xuhai Xu
16. Yilun Du
17. Shwetak Patel
18. Tim Althoff
19. Daniel McDuff
20. Xin Liu
This article has no evaluationsLatest version Jan 23, 2026
Contextual Trust Evaluation for Robust Coordination in Large Language Model Multi-Agent Systems

This article has 6 authors:
1. Kangning Gao
2. Haotian Zhu
3. Rui Liu
4. Jinming Li
5. Xu Yan
6. Yi Hu
This article has no evaluationsLatest version Dec 31, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

PRIME: Policy-Reinforced Iterative Multi-Agent Execution for Algorithmic Reasoning in Large Language Models

Towards a Science of Scaling Agent Systems

Contextual Trust Evaluation for Robust Coordination in Large Language Model Multi-Agent Systems