vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations

International Conference on Learning Representations (ICLR)

Abstract

We propose vq-wav2vec to learn discrete representations of audio segments through a wav2vec-style self-supervised context prediction task. The algorithm uses either a Gumbel-Softmax or online k-means clustering to quantize the dense representations. Discretization enables the direct application of algorithms from the NLP community which require discrete inputs. Experiments show that BERT pre-training achieves a new state of the art on TIMIT phoneme classification and WSJ speech recognition.

Featured Publications

All Publications

A Method for Animating Children’s Drawings of the Human Figure

Harrison Jesse Smith, Qingyuan Zheng, Yifei Li, Somya Jain, Jessica K. Hodgins

Simulation and Retargeting of Complex Multi-Character Interactions

Yunbo Zhang, Deepak Gopinath, Yuting Ye, Jessica Hodgins, Greg Turk, Jungdam Won

Reasoning over Public and Private Data in Retrieval-Based Systems

Simran Arora, Patrick Lewis, Angela Fan, Jacob Kahn, Christopher Ré

All Publications