SLMGAN

In recent years, large-scale pre-trained speech language models (SLMs) have demonstrated remarkable advancements in various generative speech modeling applications, such as text-to-speech synthesis, voice conversion, and speech enhancement. These applications typically involve mapping text or speech inputs to pre-trained SLM representations, from which target speech is decoded. This paper introduces a new approach, SLMGAN, to leverage SLM representations for discriminative tasks within the generative adversarial network (GAN) framework, specifically for voice conversion. Building upon StarGANv2-VC, we add our novel SLM-based WavLM discriminators on top of the mel-based discriminators along with our newly designed SLM feature matching loss function, resulting in an unsupervised zero-shot voice conversion system that does not require text labels during training. Subjective evaluation results show that SLMGAN outperforms existing state-of-the-art zero-shot voice conversion models in terms of naturalness and achieves comparable similarity, highlighting the potential of SLM-based discriminators for related applications.

Zero-Shot Conversion

All of the following audios are converted from an unseen speaker to another unseen speaker during training. For a fair comparison to the baseline models, all audios are downsampled to 16k Hz. The input to VC models was trimmed so the output has a different length from the input.

All utterances are completely unseen during training, and the results are uncurated (NOT cherry-picked) unless otherwise specified.

For more audio samples, please go to our survey used for MOS evaluation here. You may have to randomly select some answers before proceeding to the next page.

Sample 1 and 2