MPFM-VC: A Voice Conversion Algorithm based on Multi-Dimensional Perception Flow Matching
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Voice conversion (VC) is a cutting-edge technology that enables the transformation of raw speech into high-quality audio resembling a target speaker, while preserving the original linguistic content and prosodic patterns.In this paper, we propose a novel voice conversion algorithm, Multi-Dimensional Perception Flow Matching (MPFM-VC).Unlike traditional approaches that directly generate waveform outputs, MPFM-VC models the evolutionary trajectory of mel-spectrograms through a flow matching framework, and incorporates a multi-dimensional feature perception network to enhance the stability and quality of speech synthesis. Additionally, we introduce a content perturbation method during training to improve the model’s generalization ability and reduce inference-time artifacts. To further increase speaker similarity, an adversarial training mechanism on speaker embeddings is employed to achieve effective disentanglement between content and speaker identity representations, thereby enhancing the timbre consistency of the converted speech. Experimental results on both speech and singing voice conversion tasks demonstrate that MPFM-VC outperforms existing state-of-the-art VC models in both subjective and objective evaluation metrics. The synthesized speech exhibits significantly improved naturalness, clarity, and timbre fidelity, validating the effectiveness of the proposed approach.