LLFM-Voice: Emotionally Expressive Speech and Singing Voice Synthesis with Large Language Models via Flow Matching
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Although emotional speech synthesis has seen significant progress, existing methods often struggle to generate naturally fluent emotional expression and face trade-offs between emotional richness and overall speech quality. We propose a Emotionally Expressive Speech and Singing Voice Synthesis with Large Language Models via Flow Matching, LLFM-Voice, a unified framework that enhances emotional expressiveness in both speech and singing voice synthesis. Our method leverages the contextual modeling capabilities of large language models and incorporates emotional information through an autoregressive mechanism. To further capture musical nuances in singing, we design a fine-grained emotional generator that integrates vocal techniques, tension, and pitch for precise control of expressive singing. In addition, we introduce a flow matching-based acoustic model that models the temporal evolution of mel spectrograms instead of predicting them directly, thereby mitigating artifacts introduced by conventional spectral modeling. Experiments show that LLFM-Voice outperforms baseline systems across multiple emotional expressiveness metrics, producing speech with richer emotional content and singing voices with more natural melodic expression.