Quantization of a Llama Language Model for improved Efficiency and Inference

S Madhanegha
V Vishnuvaradhan
R Arun
I Surenther

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite their transformational potential, large language models (LLMs) like Llama are difficult to implement on devices with limited computational power due to their high computational requirements. This study explores the quantization of the Lamba model, a method that minimizes memory footprint and model size for effective deployment. In order to obtain significant model compression with acceptable performance, we investigate different quantization techniques. At various quantization levels, the study will assess the trade-off between efficiency and accuracy. We will also look into how quantization affects the target devices' power consumption and inference speed. By enabling deployment on resource-constrained platforms and effectively quantifying the Llama model, this initiative seeks to democratize access to potent AI tools, encouraging greater innovation and practical applications. Additionally, a smaller model results in cheaper implementation costs and enhanced sustainability due to lower inference power usage. In order to quantified the Llama model, this research explores a number of technical approaches, assesses performance trade-offs, and optimizes deployment for effective hardware use. This project's objective is to successfully quantify the Llama model in order to show that it is feasible to implement it in contexts with limited resources. The results will help create LLMs that are easier to use and more effective.

Version published to 10.21203/rs.3.rs-6021454/v1 on Research Square
Feb 17, 2025

Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact

This article has 2 authors:
1. Muchen Huan
2. Jianhong Shun
This article has no evaluationsLatest version Feb 20, 2025
The Hidden Cost of AI: Unraveling the Power-Hungry Nature of Large Language Models

This article has 3 authors:
1. Md Naseef Ur Rahman Chowdhury
2. Ahshanul Haque
3. Hamdy Soliman
This article has no evaluationsLatest version Feb 20, 2025
Optimizing Large Language Models for Efficiency: A Dual-Model Architecture with Dynamic Vocabulary Adjustment

This article has 1 author:
1. Tom Vatland
This article has no evaluationsLatest version Mar 21, 2025

Listed in

Abstract

Article activity feed

Related articles

Fine-Tuning Transformers Efficiently: A Survey on LoRA and Its Impact

The Hidden Cost of AI: Unraveling the Power-Hungry Nature of Large Language Models

Optimizing Large Language Models for Efficiency: A Dual-Model Architecture with Dynamic Vocabulary Adjustment