A Uyghur–Chinese parallel dataset of proverbs
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Uyghur-Chinese paired resources that couple brief, culturally situated expressions with matched speech remain scarce, yet they are valuable for low-resource machine translation and speech-enabled NLP. We introduce UyZh-FolkSpeech, a manually curated Uyghur-Chinese dataset covering proverbs, frequently used daily phrases, and common words or short phrases, collected entirely in-house by the author team through elicitation, transcription, and bilingual alignment. 1,2 The release provides 1,984 text items in total, including 953 short-sentence pairs and 1,031 word/short-phrase entries, each assigned an immutable identifier (UYZH-S-* for sentences; UYZH-W-* for words/phrases). For every item, we release four native-speaker recordings (S01–S04; two male and two female), yielding 7,936 linked audio clips. Across the full audio collection, the total duration is 08:28:17 (30,497.196 seconds, approximately 8.47 hours), and the mean clip duration is 3.84 seconds. All audio is distributed as M4A files encoded with AAC-LC at 48 kHz, mono, with a target bitrate of approximately 64 kbps, and is linked to text records via a manifest that includes required technical metadata. The package further includes recommended train/validation/test splits (including a fixed eval50 list), and optional scripts/configs to reproduce the provided fine-tuning example. The dataset (text, metadata, and audio) is released under CC BY 4.0 and is version-pinned to the GitHub release tag release-2026-01-25 alongside the corresponding Hugging Face dataset page.