Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

Rakshit Dabral

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This paper introduces Story Completer, a sophisticated and efficient engine for real-time children's story completion, extending the foundational work of the TinyStories project. While TinyStories demonstrated that small models (< 10M parameters) can generate coherent narratives on simplified data, our work addresses the challenge of deploying high-quality, context-aware generative models on resource-constrained hardware. The core innovation is a hybrid architecture that synergizes the rich semantic knowledge of a large language model with the computational efficiency of a smaller one. We integrate pre-computed GPT-4 text-embedding-ada-002 vectors within a compact, 12-million-parameter decoder-only transformer, effectively distilling the contextual understanding of a massive model into a lightweight system. Our methodology involved training this custom model from scratch using a meticulous strategy designed to adapt the model specifically for storytelling.A key contribution of this project is the optimization of the trained model for practical deployment on consumer-grade hardware, including low-end PCs and CPUs. After initial training, we applied post-training quantization , converting the model's weights from 16-bit floating-point precision (FP16) to 8-bit unsigned integers (uint8). This optimization yielded significant performance gains without a noticeable degradation in narrative quality.Comparative analysis between the normal and quantized models demonstrates the effectiveness of this approach. The quantization process reduced the model size by 49.5% , from 1546.04 MB to 780.03 MB. Furthermore, it achieved a 55.4% speedup in average inference time , decreasing from 21.701 seconds to 9.671 seconds. This project provides strong evidence for an efficient paradigm in model design, where the distilled intelligence of larger models, combined with optimization techniques like quantization, can be leveraged to create smaller, faster, and highly capable specialized systems.

Version published to 10.21203/rs.3.rs-7914593/v1 on Research Square
Oct 22, 2025

Improving Large Language Models with Concept-Aware Fine-Tuning

This article has 5 authors:
1. Dacheng Tao
2. Michael Chen
3. Xikun ZHANG
4. Jiaxing Huang
5. Yingjie Wang
This article has no evaluationsLatest version Oct 1, 2025
Hierarchical Prompt Composition for Memory-Efficient Open-World Continual Learning in Vision-Language Foundation Models

This article has 1 author:
1. REBBAH SIHAM
This article has no evaluationsLatest version Oct 30, 2025
Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

This article has 2 authors:
1. Anuj Attri
2. HariOm .
This article has no evaluationsLatest version Nov 4, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Improving Large Language Models with Concept-Aware Fine-Tuning

Hierarchical Prompt Composition for Memory-Efficient Open-World Continual Learning in Vision-Language Foundation Models

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation