Using Dynamic Token Embedding Compression to Optimize Inference Process in Large Language Models

Diana Gavrilo
Christopher Johansson
Thomas Petrovich
Gabriel Martins
Samuel Edwards

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-scale deep learning architectures, while transformative for language understanding and generation, impose substantial computational and memory demands during inference, often limiting their practical deployment in constrained environments. Introducing a novel methodology, Dynamic Token Embedding Compression (DTEC), this work addresses these challenges through a selective token embedding mechanism that dynamically adjusts embedding dimensionality based on contextual relevance during inference. DTEC optimizes memory usage and inference time by applying high compression to low-relevance tokens while preserving dimensionality for those deemed critical to context, resulting in significant improvements in resource efficiency. Experimental results demonstrate that DTEC reduces inference time by 25.6% and memory consumption by 30.2% on average across various text lengths without compromising model accuracy or output quality. Moreover, DTEC effectively lowers hallucination rates, enhancing model fidelity and strengthening its application across tasks requiring precision and reliability. With its adaptive token prioritization, DTEC emerges as an efficient framework for resource-limited environments and a promising approach for real-time, scalable LLM deployment.

Version published to 10.21203/rs.3.rs-5330254/v1 on Research Square
Oct 28, 2024

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025
Structuring Low-Rank Adaptation with Semantic Guidance for Model Fine-Tuning

This article has 6 authors:
1. Hongye Zheng
2. Yumeng Ma
3. Yichen Wang
4. Guiran Liu
5. Zhen Qi
6. Xu Yan
This article has no evaluationsLatest version Jun 18, 2025

Listed in

Abstract

Article activity feed

Related articles

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

Structuring Low-Rank Adaptation with Semantic Guidance for Model Fine-Tuning