Natural Language Processing & Speech Archives

PUBLICATIONS

Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

This paper presents Voicebox, the most versatile text-conditioned speech generative model at scale. Voicebox is trained on a text-guided speech infilling task, where the goal is to generate masked speech given its surrounding audio and text transcript.

PUBLICATIONS

PLUE: Language Understanding Evaluation Benchmark for Privacy Policies in English

To address these problems and encourage re- search to develop NLU technologies in the privacy policy domain, we introduce the Privacy Policy Language Understanding Evaluation (PLUE) benchmark, to evaluate the privacy policy language understanding across six tasks, including text classification, question answering, semantic parsing, and named-entity recognition.

PUBLICATIONS

Learning ASR Pathways: A Sparse Multilingual ASR Model

Neural network pruning compresses automatic speech recognition (ASR) models effectively. However, in multilingual ASR, language ...

PUBLICATIONS

GCT: Gated Contextual Transformer For Sequential Audio Tagging

We propose a new neural network architecture for the task of sequential audio tagging. "Sequential audio tagging" means we want to know what types of acoustic events (e.g. dog bark, car engine) occur in an audio recording, and in what order they occur.

PUBLICATIONS

Leveraging Heteroscedastic Uncertainty in Learning Complex Spectral Mapping for Single-channel Speech Enhancement

To mitigate undersampling, our approach inflates the uncertainty lower bound and weights each loss component with their uncertainty, effectively compensating severely...

PUBLICATIONS

Massively Multilingual ASR on 70 Languages: Tokenization, Architecture, and Generalization Capabilities

End-to-end multilingual ASR has become more appealing because of several reasons such as simplifying the training and deployment process and positive performance ...

PUBLICATIONS

Cocktail Hubert: Generalized Self-Supervised Pre-Training for Mixture and Single-Source Speech

This paper presents Cocktail HuBERT, a self-supervised learning framework that generalizes to mixture speech using a masked pseudo source separation objective.

PUBLICATIONS

SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision

Visual speech recognition (VSR), also known as lip reading, is the task of recognizing speech content based on visual lip movements. VSR has a wide range of applications in real-world scenarios such as helping the hearing- impaired perceive human speech and improving automatic speech recognition (ASR) in noisy environments.

PUBLICATIONS

LA-VocE: Low-SNR Audio-visual Speech Enhancement Using Neural Vocoders

In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audiovisual speech via a transformer-based architecture, and then converts...

PUBLICATIONS

Scaling Speech Technology to 1,000+ Languages

we build a new dataset comprising a moderate amount of labeled data for 1,107 languages and another dataset of unlabeled speech in 3,809 languages (§3). We leverage ....