Using Dynamic Token Embedding Compression to Optimize Inference Process in Large Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large-scale deep learning architectures, while transformative for language understanding and generation, impose substantial computational and memory demands during inference, often limiting their practical deployment in constrained environments. Introducing a novel methodology, Dynamic Token Embedding Compression (DTEC), this work addresses these challenges through a selective token embedding mechanism that dynamically adjusts embedding dimensionality based on contextual relevance during inference. DTEC optimizes memory usage and inference time by applying high compression to low-relevance tokens while preserving dimensionality for those deemed critical to context, resulting in significant improvements in resource efficiency. Experimental results demonstrate that DTEC reduces inference time by 25.6% and memory consumption by 30.2% on average across various text lengths without compromising model accuracy or output quality. Moreover, DTEC effectively lowers hallucination rates, enhancing model fidelity and strengthening its application across tasks requiring precision and reliability. With its adaptive token prioritization, DTEC emerges as an efficient framework for resource-limited environments and a promising approach for real-time, scalable LLM deployment.

Article activity feed