Using Dynamic Token Embedding Compression to Optimize Inference Process in Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-scale deep learning architectures, while transformative for language understanding and generation, impose substantial computational and memory demands during inference, often limiting their practical deployment in constrained environments. Introducing a novel methodology, Dynamic Token Embedding Compression (DTEC), this work addresses these challenges through a selective token embedding mechanism that dynamically adjusts embedding dimensionality based on contextual relevance during inference. DTEC optimizes memory usage and inference time by applying high compression to low-relevance tokens while preserving dimensionality for those deemed critical to context, resulting in significant improvements in resource efficiency. Experimental results demonstrate that DTEC reduces inference time by 25.6% and memory consumption by 30.2% on average across various text lengths without compromising model accuracy or output quality. Moreover, DTEC effectively lowers hallucination rates, enhancing model fidelity and strengthening its application across tasks requiring precision and reliability. With its adaptive token prioritization, DTEC emerges as an efficient framework for resource-limited environments and a promising approach for real-time, scalable LLM deployment.