Natural speech re-synthesis from direct cortical recordings using a pre-trained encoder-decoder framework
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Reconstructing perceived speech stimuli from neural recordings is not only advancing the understanding of the neural coding underlying speech processing but also an important building block for brain-computer interfaces and neuroprosthetics. However, previous attempts to directly re-synthesize speech from neural decoding suffer from low re-synthesis quality. With the limited neural data and complex speech representation space, it is hard to build decoding model that directly map neural signal into high-fidelity speech. In this work, we proposed a pre-trained encoder-decoder framework to address these problems. We recorded high-density electrocorticography (ECoG) signals when participants listening to natural speech. We built a pre-trained speech re-synthesizing network that consists of a context-dependent speech encoding network and a generative adversarial network (GAN) for high-fidelity speech synthesis. This model was pre-trained on a large naturalistic speech corpus and can extract critical features for speech re-synthesize. We then built a light-weight neural decoding network that mapped the ECoG signal into the latent space of the pre-trained network, and used the GAN decoder to synthesize natural speech. Using only 20 minutes of intracranial neural data, our neural-driven speech re-synthesis model demonstrated promising performance, with phoneme error rate (PER) at 28.6%, and human listeners were able to recognize 71.6% of the words in the re-synthesized speech. This work demonstrates the feasibility of using pre-trained self-supervised model and feature alignment to build efficient neural-to-speech decoding model.