Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

Adarsh MS
Jithin VG
Ditto PS

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are known for their exceptional performance across a range of natural language processing tasks, but their deployment comes at a high computational and financial cost. On the other hand, smaller language models (SLMs), which can be deployed on lower-cost edge devices, struggle to match the performance of their larger counterparts. This paper presents a novel hybrid inference approach that leverages the strengths of both model types while minimizing reliance on costly cloud-based LLMs. Unlike existing methods that route entire queries to either an SLM or a cloud LLM, our approach introduces a reward-based mechanism to dynamically determine the involvement of the cloud LLM during token generation. Specifically, each token predicted by the SLM is evaluated against a reward score, and only when this score falls below a certain threshold is the cloud LLM consulted for assistance in the next token prediction. This method not only reduces the traffic to the cloud LLM, thereby lowering costs, but also allows for flexible control over response quality depending on the reward score threshold. Experimental results demonstrate that our approach significantly reduces cloud LLM usage with minimal impact on overall response quality, offering a cost-effective solution for deploying high-performance language models.

Version published to 10.20944/preprints202409.1208.v1
Sep 17, 2024

Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025
Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

This article has 5 authors:
1. Ankit Parag Shah
2. Mohammad-Parsa Hosseini
3. Su Min Park
4. Connie Miao
5. Wei Wei
This article has no evaluationsLatest version Jan 13, 2026
Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Best Practices for Using Large Language Models at Scale

Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods