A Uyghur–Chinese parallel dataset of proverbs

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Uyghur-Chinese paired resources that couple brief, culturally situated expressions with matched speech remain scarce, yet they are valuable for low-resource machine translation and speech-enabled NLP. We introduce UyZh-FolkSpeech, a manually curated Uyghur-Chinese dataset covering proverbs, frequently used daily phrases, and common words or short phrases, collected entirely in-house by the author team through elicitation, transcription, and bilingual alignment. 1,2 The release provides 1,984 text items in total, including 953 short-sentence pairs and 1,031 word/short-phrase entries, each assigned an immutable identifier (UYZH-S-* for sentences; UYZH-W-* for words/phrases). For every item, we release four native-speaker recordings (S01–S04; two male and two female), yielding 7,936 linked audio clips. Across the full audio collection, the total duration is 08:28:17 (30,497.196 seconds, approximately 8.47 hours), and the mean clip duration is 3.84 seconds. All audio is distributed as M4A files encoded with AAC-LC at 48 kHz, mono, with a target bitrate of approximately 64 kbps, and is linked to text records via a manifest that includes required technical metadata. The package further includes recommended train/validation/test splits (including a fixed eval50 list), and optional scripts/configs to reproduce the provided fine-tuning example. The dataset (text, metadata, and audio) is released under CC BY 4.0 and is version-pinned to the GitHub release tag release-2026-01-25 alongside the corresponding Hugging Face dataset page.

Article activity feed