Fine-Tuning Large Language Models for Kazakh Text Simplification

Alymzhan Toleu
Gulmira Tolegen
Irina Ualiyeva

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This work addresses the task of text simplification for Kazakh, a morphologically rich and low-resource language. We propose KazSim, a fine-tuned simplification model based on large language models (LLM), using both the multilingual Llama series and Qwen2 backbones To support model training, we construct a parallel simplification dataset using a proposed Kazakh sentence complexity identification pipeline, selecting complex sentences from raw corpora with the proposed heuristic approach. As baselines, we include standard Seq2Seq models, Kazakh domain-specific large language models, and general-purpose instruction-following models in a zero-shot setup. Evaluation is performed on both an automatically constructed test set and a semi-manually created benchmark, using SARI, BLEU, ROUGE, and Bert-score. Results show that KazSim consistently outperforms all baselines, including domain-specific LLMs and zero-shot models, achieving good simplification quality while preserving meaning and controlling output length. We also examine the impact of prompt language on generation quality, comparing English and Kazakh instructions. While performance remains consistent overall, models tend to produce slightly better outputs when prompted in Kazakh, particularly in zero-shot and domain-specific settings.

Version published to 10.20944/preprints202506.1947.v1
Jun 24, 2025

Listed in

Abstract

Article activity feed