TTS Skins: Speaker Conversion via ASR

Interspeech

Abstract

We present a fully convolutional wav-to-wav network for converting between speakers’ voices, without relying on text. Our network is based on an encoder-decoder architecture, where the encoder is pre-trained for the task of Automatic Speech Recognition, and a multi-speaker waveform decoder is trained to reconstruct the original signal in an autoregressive manner. We train the network on narrated audiobooks, and demonstrate multi-voice TTS in those voices, by converting the voice of a TTS robot.

Latest Publications

Log-structured Protocols in Delos

Mahesh Balakrishnan, Mihir Dharamshi, David Geraghty, Santosh Ghosh, Filip Gruszczynski, Jun Li, Jingming Liu, Suyog Mapara, Rajeev Nagar, Ivailo Nedelchev, Francois Richard, Chen Shen, Yee Jiun Song, Rounak Tibrewal, Vidhya Venkat, Ahmed Yossef, Ali Zaveri

SOSP