It’s called the cocktail party problem, and it’s a problem we have all encountered. You’re trying to converse with a group of people in a crowded, noisy room, but other conversations nearby and additional background noise make it difficult for you to focus. How do you tune in to the voices you want to hear while tuning out the voices and additional noise you don’t?
This is a research question the audio team at Reality Labs Research is answering as part of our hearing sciences program. To encourage others in the scientific community to bring their talents to the task, Meta is collaborating with Imperial College London on an international competition called the SPEAR Challenge (speech enhancement for augmented reality).
Imagine you’re the listener sitting at a table with several other people having a lively discussion in a room with significant background noise. Conversation switches rapidly between speakers. There are interruptions and overlaps. Music, crowd noise, and other sounds add to an already complex listening environment. How well can you follow and understand the person you most want to listen to at any moment? Participants in the SPEAR Challenge will be competing to develop the best machine learning (ML) models and other algorithmic approaches to enhance this process for all individuals.
Teams will begin by downloading a training data set and software tools that provide a baseline solution to the problem using conventional signal processing methods. Researchers will then use their creativity to find better algorithmic solutions. Near the end of the challenge, an evaluation data set will be released. Teams will apply their algorithms to this previously unseen data and submit the enhanced audio. These enhanced signals will be scored by scientists at Imperial College London using a combination of objective measures and human listening tests.
Training and testing ML models and other types of algorithms requires large amounts of diverse data that is representative both of the problem to be solved and the future users of the technology. The data used by teams competing in the SPEAR Challenge will be derived from the EasyCom data set. EasyCom is an augmented reality (AR) data set designed to support algorithm development for effortless conversation in noisy environments. Created in 2021 by a team of Meta researchers in Redmond, Washington, EasyCom was publicly released to facilitate research in AR solutions to the cocktail party problem.
Without the EasyCom data set, events like the SPEAR Challenge might not attract the same level of attention from other scientists in the field. “We wanted to solve difficult problems in human hearing and communication, but there were no publicly available data sets created with sensor-rich AR glasses in noisy environments,” says Research Scientist Jacob Donley, PhD. “Once EasyCom was created, we realized that the broader scientific community would benefit from using it to solve similar challenges as well.”
EasyCom is the world’s first data set with rich audio-visual signals and additional information on how humans communicate in a typical noisy environment such as a restaurant. The data was collected in a room with motion capture cameras, loudspeakers generating restaurant background noise, and participants wearing an AR research device equipped with multiple microphones. Participants introduced themselves with fake names and occupations, ordered from a fake restaurant menu, and solved riddles and played games to facilitate head movement and other realistic complexities of multispeaker conversations.
“EasyCom is the first data set with extensive annotations from which studies of the egocentric point of view based on a listener’s perception of discrete audio-visual objects in a real-word context can be undertaken,” says Research Lead Thomas Lunner, PhD.
The cocktail party problem has been part of human experience ever since people started talking in groups. It was given its name in 1953 by British cognitive scientist and Imperial College London Professor of Communication Colin Cherry. Almost 70 years later, despite extraordinary technological progress, it remains unsolved. Why is this problem so challenging?
It turns out that the problem of hearing others speak in noisy environments is much bigger than hearing; it involves significant cognitive processing as well. “The role of cognition in speech perception has been underestimated, and this is why the problem has remained unsolved,” Lunner explains. “We need to understand how the listener’s brain makes sense of speech in contexts where people struggle to converse over noise. The role of context is fundamental. What are the objects in the scene? How do we parse them? Where is the listener’s attention focused? A comprehensive, holistic approach is required.”
This approach engages the talents of dozens of people on the audio research team working on augmented reality. Meta also partners with academic institutions like Imperial College London to accelerate critical areas of research and development.
“We have a difficult task at hand and only so much we can do on our own,” says Research Lead Vladimir Tourbabin, PhD. “We want to invite the involvement of broader scientific communities by showing that the problem is important to solve and that it is closer to being solved because we now have real data to support the development of better solutions.”
For more information, including participant registration, visit the SPEAR Challenge website.
Disclaimer: The SPEAR Challenge is a general purpose research project with the goal to develop ML models intended for use by all individuals in specific listening environments. This research is not intended to aid persons with, or compensate for, hearing impairments.