MPFM-VC: A Voice Conversion Algorithm Based on Multi-Dimensional Perception Flow Matching
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Voice conversion (VC) is an advanced technology that enables the transformation of raw speech into high-quality audio resembling the target speaker’s voice while preserving the original linguistic content and prosodic patterns. In this study, we propose a voice conversion algorithm, Multi-Dimensional Perception Flow Matching (MPFM-VC). Unlike traditional approaches that directly generate waveform outputs, MPFM-VC models the evolutionary trajectory of mel spectrograms with a flow-matching framework and incorporates a multi-dimensional feature perception network to enhance the stability and quality of speech synthesis. Additionally, we introduce a content perturbation method during training to improve the model’s generalization ability and reduce inference-time artifacts. To further increase speaker similarity, an adversarial training mechanism on speaker embeddings is employed to achieve effective disentanglement between content and speaker identity representations, thereby enhancing the timbre consistency of the converted speech. Experimental results for both speech and singing voice conversion tasks show that MPFM-VC achieves competitive performance compared to existing state-of-the-art VC models in both subjective and objective evaluation metrics. The synthesized speech shows improved naturalness, clarity, and timbre fidelity in both objective and subjective evaluations, suggesting the potential effectiveness of the proposed approach.