Scaling ASR Improves Zero and Few Shot Learning

Interspeech

Abstract

With 4.5 million hours of English speech from 10 different sources across 120 countries and models of up to 10 billion parameters, we explore the frontiers of scale for automatic speech recognition. We propose some empirical data selection techniques to efficiently find the most valuable samples in pseudo-labeled datasets. To efficiently scale model sizes, we leverage various optimizations such as sparse transducer loss and parameter sharding. By training universal English ASR models with up to 10B parameters, we push the limits of speech recognition performance across many domains. Furthermore, our models were able to generalize well to novel domains and styles of speech, exceeding previous results across multiple in-house and public benchmarks. On AphasiaBank, a dataset from speakers with disorders due to brain damage, our best zero-shot and few-shot models achieve 22% and 60% relative improvement, respectively. The same universal model reaches equivalent performance with 500x less in-domain data on the SPGISpeech financial-domain dataset.

Featured Publications