Optimizing Large Language Models for Efficiency: A Dual-Model Architecture with Dynamic Vocabulary Adjustment
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models (LLMs) have revolutionized natural language processing but incursignificant computational and energy costs. We propose a novel dual-model architecturethat optimizes resource use by splitting processing between a lightweight model (Model B) and a full-capacity model (Model A). Model B handles frequent conversational patterns by converting the 70% least-used input tokens into a single [RARE] token, dynamically adjusted based on usage patterns. Critically, Model B also identifies [RARE] tokens within its output stream and top-k selections, routing these to Model A for nuanced handling. Model A processes all complex inputs and [RARE] outputs. Simulations suggest savings of 40-55% in power consumption and 30-40% in server capacity, with potential for greater efficiency through optimization. This approach offers a scalable, adaptive solution for deploying LLMs in resource-constrained environments, ensuring that both rare input and output tokens are processed with full model capacity.