BnVITS: A Voice Cloning Approach for Single Speaker Text-to-Speech
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Although significant progress has been made in voice cloning and text-to-speech (TTS) models, especially in generating natural-sounding speech, low-resource languages such as Bangla (Bn) and other languages remain nearly unexplored. Despite recent advancements, TTS systems for the Bangla language still encounter difficulties due to the intricate phonology and morphology. Furthermore, no previous work has been done on voice cloning for Bangla. To address the research gap, we provide a voice cloning method that uses the limited amount of speech data possible to build a TTS system for Bangla. Additionally, we introduce PYBANGLA, a text normalization tool created especially for Bangla language processing. Voice cloning can be accomplished by honing the top-performing TTS models with just a few target speaker samples. Both subjective and objective evaluation metrics have been conducted to assess the system, and the results show that our BnVITS model performs better than the earlier Bangla TTS model. This approach opens up new opportunities for individualized voice technology by paving the road for more efficient Bangla TTS approaches in terms of speech data.