Speech Recognition and Synthesis Models and Platforms for the Kazakh Language

Aidana Karibayeva
Vladislav Karyukin
Balzhan Abduali
Dina Amirova

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the rapid development of artificial intelligence and machine learning technologies, automatic speech recognition (ASR) and text-to-speech (TTS) have become key components of the digital transformation of society. The Kazakh language, as a representative of the Turkic language family, remains a low-resource language with limited audio corpora, language models, and high-quality speech synthesis systems. This study provides a comprehensive analysis of existing speech recognition and synthesis models, emphasizing their applicability and adaptation to the Kazakh language. Special attention is given to linguistic and technical barriers, including the agglutinative structure, rich vowel system, and phonemic variability. Both open-source and commercial solutions were evaluated, including Whisper, GPT-4 Transcribe, ElevenLabs, OpenAI TTS, Voiser, KazakhTTS2, and TurkicTTS. Speech recognition systems were assessed using BLEU, WER, TER, chrF, and COMET, while speech synthesis was evaluated with MCD, PESQ, STOI, and DNSMOS, thus covering both lexical–semantic and acoustic–perceptual characteristics. The results demonstrate that, for speech-to-text (STT), the strongest performance was achieved by Soyle on domain-specific data (BLEU 74.93, WER 18.61), while Voiser showed balanced accuracy (WER 40.65–37.11, chrF 80.88–84.51) and GPT-4 Transcribe achieved robust semantic preservation (COMET up to 1.02). In contrast, Whisper performed weakest (WER 77.10, BLEU 13.22), requiring further adaptation for Kazakh. For text-to-speech (TTS), KazakhTTS2 delivered the most natural perceptual quality (DNSMOS 8.79–8.96), while OpenAI TTS achieved the best spectral accuracy (MCD 123.44–117.11, PESQ 1.14). TurkicTTS offered reliable intelligibility (STOI 0.15, PESQ 1.16), and ElevenLabs produced natural but less spectrally accurate speech.

Version published to 10.3390/info16100879
Oct 10, 2025
Version published to 10.20944/preprints202507.2282.v1
Jul 28, 2025

Operationalizing shared phonetic space in bilingual speech: A quantitative proof of concept for the Revised Speech Learning Model

This article has 1 author:
1. Alexandre Menezes Barroso
This article has no evaluationsLatest version Apr 10, 2026
Pertsch: A Corpus of Persian and German Based on Different Speech Elicitation Tasks

This article has 1 author:
1. Neda Mousavi
This article has no evaluationsLatest version May 11, 2026
Sound and meaning: On the duration of Japanese homophones

This article has 2 authors:
1. Motoki Saito
2. Ruben van de Vijver
This article has no evaluationsLatest version Apr 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Operationalizing shared phonetic space in bilingual speech: A quantitative proof of concept for the Revised Speech Learning Model

Pertsch: A Corpus of Persian and German Based on Different Speech Elicitation Tasks

Sound and meaning: On the duration of Japanese homophones