Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

This paper introduces Story Completer, a sophisticated and efficient engine for real-time children's story completion, extending the foundational work of the TinyStories project. While TinyStories demonstrated that small models (< 10M parameters) can generate coherent narratives on simplified data, our work addresses the challenge of deploying high-quality, context-aware generative models on resource-constrained hardware. The core innovation is a hybrid architecture that synergizes the rich semantic knowledge of a large language model with the computational efficiency of a smaller one. We integrate pre-computed GPT-4 text-embedding-ada-002 vectors within a compact, 12-million-parameter decoder-only transformer, effectively distilling the contextual understanding of a massive model into a lightweight system. Our methodology involved training this custom model from scratch using a meticulous strategy designed to adapt the model specifically for storytelling.A key contribution of this project is the optimization of the trained model for practical deployment on consumer-grade hardware, including low-end PCs and CPUs. After initial training, we applied post-training quantization , converting the model's weights from 16-bit floating-point precision (FP16) to 8-bit unsigned integers (uint8). This optimization yielded significant performance gains without a noticeable degradation in narrative quality.Comparative analysis between the normal and quantized models demonstrates the effectiveness of this approach. The quantization process reduced the model size by 49.5% , from 1546.04 MB to 780.03 MB. Furthermore, it achieved a 55.4% speedup in average inference time , decreasing from 21.701 seconds to 9.671 seconds. This project provides strong evidence for an efficient paradigm in model design, where the distilled intelligence of larger models, combined with optimization techniques like quantization, can be leveraged to create smaller, faster, and highly capable specialized systems.

Article activity feed