Multilingual Text-to-Speech Training Using Cross Language Voice Conversion and Self-supervised Learning of Speech Representations

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

Abstract

State of the art text-to-speech (TTS) models can generate high fidelity monolingual speech, but it is still challenging to synthesize multilingual speech from the same speaker. One major hurdle is for training data. It’s hard to find speakers who have native proficiency in several languages. One way of mitigating this issue is by generating polyglot corpus through voice conversion. In this paper, we train such multilingual TTS system through a novel cross-lingual voice conversion model trained with speaker-invariant features extracted from a speech representation model which is pre-trained with 53 languages through self-supervised learning [1]. To further improve the speaker identity shift, we also adopt a speaker similarity loss term during training. We then use this model to convert multilingual multi-speaker speech data to the voice of the target speaker. Through augmenting data from 4 other languages, we train a multilingual TTS system for a native monolingual English speaker which speaks 5 languages (English, French, German, Italian and Spanish). Our system achieves improved mean opinion score (MOS) compared with the baseline of multi-speaker system for all languages, specifically: 3.74 vs 3.62 for Spanish, 3.11 vs 2.71 for German, 3.47 vs 2.84 for Italian, and 2.72 vs 2.41 for French.

Featured Publications