MPFM-VC: A Voice Conversion Algorithm Based on Multi-Dimensional Perception Flow Matching

Yanze Wang
Xuming Han
Shuai Lv
Ting Zhou
Yali Chu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Voice conversion (VC) is an advanced technology that enables the transformation of raw speech into high-quality audio resembling the target speaker’s voice while preserving the original linguistic content and prosodic patterns. In this study, we propose a voice conversion algorithm, Multi-Dimensional Perception Flow Matching (MPFM-VC). Unlike traditional approaches that directly generate waveform outputs, MPFM-VC models the evolutionary trajectory of mel spectrograms with a flow-matching framework and incorporates a multi-dimensional feature perception network to enhance the stability and quality of speech synthesis. Additionally, we introduce a content perturbation method during training to improve the model’s generalization ability and reduce inference-time artifacts. To further increase speaker similarity, an adversarial training mechanism on speaker embeddings is employed to achieve effective disentanglement between content and speaker identity representations, thereby enhancing the timbre consistency of the converted speech. Experimental results for both speech and singing voice conversion tasks show that MPFM-VC achieves competitive performance compared to existing state-of-the-art VC models in both subjective and objective evaluation metrics. The synthesized speech shows improved naturalness, clarity, and timbre fidelity in both objective and subjective evaluations, suggesting the potential effectiveness of the proposed approach.

Version published to 10.3390/app15105503
May 14, 2025
Version published to 10.20944/preprints202504.1428.v2
Apr 18, 2025
Version published to 10.20944/preprints202504.1428.v1
Apr 17, 2025

Fake Voice Detection: A Comparative Analysis of Complex-Valued Deep Learning and Transformer Models across Multiple Languages

This article has 5 authors:
1. Mario Jojoa
2. Alfonso Bahillo
3. Dávid Sztahó
4. Giovanni Hernandez
5. Géza Nemeth
This article has no evaluationsLatest version Feb 3, 2026
Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

This article has 3 authors:
1. Jie Xu
2. Yuhao Dai
3. Zhifeng Wang
This article has no evaluationsLatest version Dec 31, 2025
Environmental Sound Classification Using Feature Fusion of MFCCs, Mel-spectrogram, and Chroma

This article has 1 author:
1. Mainul Islam
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Fake Voice Detection: A Comparative Analysis of Complex-Valued Deep Learning and Transformer Models across Multiple Languages

Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

Environmental Sound Classification Using Feature Fusion of MFCCs, Mel-spectrogram, and Chroma