Efficient Prompt Compression on Edge Devices

Samyuktha D
Chandravel Saravanan
Anusha Jayasimhan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) like GPT-4, BERT, and DeBERTa are widely used for tasks such as question answering, text summarization, and reasoning in many domains [1]–[3]. However, running these models on small or low-power devices such as mobile phones and IoT systems is challenging because they require large amounts of memory, processing power, and time when handling long text inputs called prompts. To solve this problem, researchers have developed several prompt compression methods that shorten prompts while preserving their key meaning. These include summarization-based [4], embedding-based [5], and graph-based reasoning techniques [6]. Recent methods such as LLMLingua-2 [15] and Prompt Compression with Context-Aware Sentence Encoding [20] have further improved compression quality while maintaining reasoning consistency and efficiency. Building upon these works, this paper proposes an Efficient Prompt Compression on Edge Devices framework that integrates embedding retrieval, causal-temporal reasoning, and coherence validation into a single, lightweight process. The framework produces interpretable reasoning graphs that retain only the most important information, enabling faster and more efficient processing. Experiments conducted on the CQR dataset demonstrate that the proposed model achieves high reasoning accuracy while significantly reducing computational cost, making it suitable for real-time deployment on edge devices. BLEU, and F1, making it suitable for real-world deployment on low-power devices.

Version published to 10.21203/rs.3.rs-7976248/v1 on Research Square
Oct 30, 2025

Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025
PRIME: Prompt Refinement via Information-driven Methods and Expansion, A Modular Framework for Context-Aware Prompt Amplification

This article has 1 author:
1. Rajesh More
This article has no evaluationsLatest version Jan 27, 2026
DiLLaB: Discussion Labeling with LLMs for Building Datasets

This article has 6 authors:
1. Ludimila Gonçalves
2. Márcia Lima
3. André Carvalho
4. Walter Nakamura
5. Igor Steinmacher
6. Tayana Conte
This article has no evaluationsLatest version Jan 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Best Practices for Using Large Language Models at Scale

PRIME: Prompt Refinement via Information-driven Methods and Expansion, A Modular Framework for Context-Aware Prompt Amplification

DiLLaB: Discussion Labeling with LLMs for Building Datasets