Efficient Compression of Large Language Models: A Case Study on Llama 2 with 13B Parameters

Mariko Konishi
Kentaro Nakano
Yoko Tomoda

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Efficient compression of large language models is important for enhancing computational efficiency and reducing the necessary virtual memory requirements for deployment in resource-limited environments. Employing a combination of advanced compression techniques—pruning, quantization, and knowledge distillation—this investigation explores the viability of reducing model size while maintaining high performance levels. Pruning reduces the number of non-critical parameters, significantly lowering the computational load. Quantization decreases the precision of the numerical data within the model, thus reducing the memory footprint without substantially affecting the model's ability to process data effectively. Knowledge distillation involves training a smaller, more compact model to replicate the predictive performance of the larger model, ensuring minimal loss of capability. The results demonstrate that the application of these techniques not only considerably enhances the speed of data processing but also reduces the operational costs and energy consumption, thereby supporting the deployment of powerful AI models in a wider variety of settings. The exploration confirms that while there is a slight reduction in accuracy and F1-score, the trade-offs are justifiable given the benefits gained in processing efficiency and resource utilization. By meticulously optimizing the balance between compression intensity and model performance, tailored solutions can be developed to meet specific operational needs, thereby expanding the practical applications of large language models in real-world scenarios. The outcomes underscore the transformative potential of model compression technologies to advance the field of artificial intelligence by enabling more sustainable and efficient AI deployments across diverse computational environments.

Version published to 10.21203/rs.3.rs-4407954/v1 on Research Square
May 14, 2024

Efficient Grammar Compression via RLZ-based RePair

This article has 3 authors:
1. Rahul Varki
2. Travis Gagie
3. Christina Boucher
This article has no evaluationsLatest version Jul 26, 2025
Perceptron-based Parallel Compression Method with High Compression Ratio under Point-wise Relative Error Bound

This article has 4 authors:
1. XinZhe Chen
2. Weijing Qin
3. Lin Qiao
4. Jianjiang Li
This article has no evaluationsLatest version Jun 19, 2025
Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment

This article has 6 authors:
1. Xiandong Meng
2. Yan Wu
3. Yexin Tian
4. Xin Hu
5. Tianze Kang
6. Junliang Du
This article has no evaluationsLatest version Jul 23, 2025

Listed in

Abstract

Article activity feed

Related articles

Efficient Grammar Compression via RLZ-based RePair

Perceptron-based Parallel Compression Method with High Compression Ratio under Point-wise Relative Error Bound

Collaborative Distillation Strategies for Parameter-Efficient Language Model Deployment