SoundTwin: A High-Similarity Fast Diffusion Autoregressive Speech Cloning Model Based on Local-Global Feature Fusion
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In recent years, significant advances in deep learning have propelled text-to-speech (TTS) technology forward. However, there are still challenges in high-quality voice cloning with low-resource and low-latency conditions. Traditional autoregressive models suffer from high inference latency, while emerging diffusion models, although capable of generating high-fidelity speech, incur substantial computational overhead due to their multi-step sampling process. To mitigate these limitations, we propose SoundTwin, a novel speech synthesis framework integrating accelerated diffusion sampling with an autoregressive Transformer architecture. This approach significantly enhances synthesis efficiency without compromising speech naturalness. Furthermore, we design a Local-Global Squeeze-and-Excitation weighted Speaker embedding Network to efficiently extract fine-grained timbre features from limited reference audio, enabling rapid speaker adaptation. The model accepts target text, reference speech, and reference text as inputs, jointly modeling duration, pitch, and energy features to generate high-quality mel-spectrograms. Experimental results validate that our method achieves state-of-the-art speaker similarity and speech naturalness in zero-shot voice cloning tasks.