MPFM-VC: A Voice Conversion Algorithm based on Multi-Dimensional Perception Flow Matching

Yanze Wang
Xuming Han
Shuai Lv
Ting Zhou
Yali Chu

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Voice conversion (VC) is a cutting-edge technology that enables the transformation of raw speech into high-quality audio resembling a target speaker, while preserving the original linguistic content and prosodic patterns.In this paper, we propose a novel voice conversion algorithm, Multi-Dimensional Perception Flow Matching (MPFM-VC).Unlike traditional approaches that directly generate waveform outputs, MPFM-VC models the evolutionary trajectory of mel-spectrograms through a flow matching framework, and incorporates a multi-dimensional feature perception network to enhance the stability and quality of speech synthesis. Additionally, we introduce a content perturbation method during training to improve the model’s generalization ability and reduce inference-time artifacts. To further increase speaker similarity, an adversarial training mechanism on speaker embeddings is employed to achieve effective disentanglement between content and speaker identity representations, thereby enhancing the timbre consistency of the converted speech. Experimental results on both speech and singing voice conversion tasks demonstrate that MPFM-VC outperforms existing state-of-the-art VC models in both subjective and objective evaluation metrics. The synthesized speech exhibits significantly improved naturalness, clarity, and timbre fidelity, validating the effectiveness of the proposed approach.

Version published to 10.20944/preprints202504.1428.v2
Apr 18, 2025
Version published to 10.20944/preprints202504.1428.v1
Apr 17, 2025

MPFM-VC: A Voice Conversion Algorithm based on Multi-Dimensional Perception Flow Matching

This article has 5 authors:
1. Yanze Wang
2. Xuming Han
3. Shuai Lv
4. Ting Zhou
5. Yali Chu
This article has no evaluationsLatest version Apr 18, 2025
End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models

This article has 5 authors:
1. Alymzhan Toleu
2. Gulmira Tolegen
3. Alexandr Pak
4. Jaxylykova Assel
5. Bagashar Zhumazhanov
This article has no evaluationsLatest version Apr 14, 2025
Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion

This article has 5 authors:
1. Md. Shahid Ahammed Shakil
2. Nitun Kumar Podder
3. S.M. Hasan Sazzad Iqbal
4. Abu Saleh Musa Miah
5. Md Abdur Rahim
This article has no evaluationsLatest version Mar 25, 2025

Listed in

Abstract

Article activity feed

Related articles

MPFM-VC: A Voice Conversion Algorithm based on Multi-Dimensional Perception Flow Matching

End-to-End Multi-Modal Speaker Change Detection with Pre-Trained Models

Bangla Speech Emotion Recognition Using Deep Learning-Based Ensemble Learning and Feature Fusion