Efficient Prompt Compression on Edge Devices

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large Language Models (LLMs) like GPT-4, BERT, and DeBERTa are widely used for tasks such as question answering, text summarization, and reasoning in many domains [1]–[3]. However, running these models on small or low-power devices such as mobile phones and IoT systems is challenging because they require large amounts of memory, processing power, and time when handling long text inputs called prompts. To solve this problem, researchers have developed several prompt compression methods that shorten prompts while preserving their key meaning. These include summarization-based [4], embedding-based [5], and graph-based reasoning techniques [6]. Recent methods such as LLMLingua-2 [15] and Prompt Compression with Context-Aware Sentence Encoding [20] have further improved compression quality while maintaining reasoning consistency and efficiency. Building upon these works, this paper proposes an Efficient Prompt Compression on Edge Devices framework that integrates embedding retrieval, causal-temporal reasoning, and coherence validation into a single, lightweight process. The framework produces interpretable reasoning graphs that retain only the most important information, enabling faster and more efficient processing. Experiments conducted on the CQR dataset demonstrate that the proposed model achieves high reasoning accuracy while significantly reducing computational cost, making it suitable for real-time deployment on edge devices. BLEU, and F1, making it suitable for real-world deployment on low-power devices.

Article activity feed