A Comparative Analysis of Chain-of-Thought Distillation from Gemini 3 to Legacy (Flan-T5) and Modern (Gemma) SLMs for Domain-Specific Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large Language Models (LLMs) such as Gemini 3 demonstrate strong multi-step reasoning, but the associated memory footprint and inference latency limit suitability for real‑time, edge‑deployed financial services. Small Language Models (SLMs) enable lower-cost deployment, yet standard supervised fine‑tuning frequently fails to capture fine‑grained intent boundaries in customer support taxonomies. A comparative analysis is conducted for distillation-by-synthesis that transfers chain‑of‑thought (CoT) supervision from a Teacher LLM (Gemini 3) into two Student architectures: a legacy encoder–decoder model (Flan‑T5 Base, 250M parameters) and a modern decoder‑only model (Gemma 2B). A reasoning‑augmented training set is synthesized on Banking77 by prompting the Teacher to produce intent labels together with short, structured justifications that highlight discriminative cues (for example, separating card_arrival from card_delivery_estimate). Student models are fine‑tuned to generate both an intent label and an aligned rationale. Evaluation covers three dimensions: (1) intent accuracy, (2) reasoning fidelity measured through rubric‑based label–rationale consistency, and (3) inference latency under batch‑1 serving. Results indicate that Gemma 2B yields the strongest accuracy and the most nuanced explanations, while Flan‑T5 Base delivers a favorable deployment trade‑off by maintaining competitive accuracy with substantially lower memory demand and latency. The analysis clarifies how architectural bias (encoder–decoder stability versus decoder‑only generation capacity) interacts with CoT distillation, providing guidance for low‑latency intent classifiers in compliance‑sensitive banking environments.