In February, Meta launched the Dynabench Data Collection and Benchmarking Platform request for proposals (RFP). Our intent for this RFP was to support academics using Dynabench to run tasks and experiments on the theme of Rethinking Benchmarking. Today, we’re announcing the winners of this award.
Dynabench operates from the assumption that humans are uniquely able to interrogate model weaknesses, and thus the majority of its current work has relied upon the collaboration between humans and models in the loop: Humans identify model weaknesses so that we can improve upon them.
Researchers using Dynabench generally perform the data generation process over multiple rounds, collecting ever more challenging data that can be used to train demonstrably stronger models over time. Such data can also be used to test how well models perform on difficult AI tasks. Last September, Meta opened up the Dynabench platform for anyone interested in human-and-model-in-the-loop data collection to run their own task.
To expand access to the platform, Dynabench recently joined MLCommons, an open engineering consortium with a mission to benefit society by accelerating innovation in machine learning. In collaboration with 50+ founding partners including Meta, MLCommons builds tools for the machine learning industry, by creating benchmarks and public datasets, and innovating metrics and best practices.
With the goal of ushering in the next generation of AI models, the Dynabench RFP on Rethinking Benchmarkingattracted 26 proposals from 21 universities and institutions around the world. Thank you to everyone who took the time to submit a proposal, and congratulations to the winners!
Principal investigators are listed first unless otherwise noted.
A leaderboard and competition for human-computer adversarial QA
Jordan Boyd-Graber and Yoo Yeon Sung (University of Maryland College Park)
We propose to create challenging human-in-the-loop examples for question answering tasks by: A) defining a metric to encourage human participation in a question answering writing task, B) developing tools and visualizations to encourage the authoring of diverse questions, C) creating a new dataset that allows for effective use of human vs. computer question answering.
Creating adversarial examples for retrieval systems
Danqi Chen, Zexuan Zhong, Alexander Wettig (Princeton University)
We propose to build a new dynamic benchmark for retrieval. Our hope is that (1) this benchmark can provide us a holistic view of retrieval systems. Existing works only tackle certain aspects (e.g., rare entities, train-test overlap) based on simple intuitions. Through a human-in-the-loop component, we expect human annotators to be able to “attack” retrieval systems in many novel ways; (2) we can add the examples back to the training loops of retrieval systems and hence further improve their performance and robustness.
Towards massively multilingual visually grounded reasoning data
Desmond Elliott (University of Copenhagen)
We propose to collect data for massively multilingual visually grounded reasoning. This proposal will substantially expand the existing MaRVL dataset (Liu & Bugliarello et al., 2021), which contains images annotated with sentences in typologically diverse languages. The data will be useful for training and evaluating computational models, as well as corpus studies across languages.
Adversarial NERDs: Optimizing feedback between humans-and-model-in-the-loop
Scott A. Hale, Hannah Rose Kirk, Katerina Margatina (University of Oxford, Sheffield University)
We present the benefits of improving the feedback between humans and models in the loop to create adversarial, novel, efficient, and realistic datasets (Adversarial NERDs). We borrow from concepts of active learning strategies, including diversity, uncertainty, and representativeness sampling, to help human annotators generate adversarial data that targets the most informative parts of the model’s decision space. Through an experimental approach, we will test the benefits of feedback on novelty, efficiency, and realism, against a baseline condition with no feedback, over multiple rounds of data collection for two natural language processing (NLP) tasks. With better communication between humans and AI, we aim to improve the cost-benefit ratio of examples collected on Dynabench. Ultimately, the metrics of dataset health that we propose could be incorporated into the Dynabench interface via a leaderboard for dataset quality.
ContExTox: Context-aware and explainable toxicity detection
Maarten Sap and Xu Hui Zhou (Carnegie Mellon University)
We propose to collect ContExTox, an adversarial dataset of 60,000 statements paired with social and conversational contexts and toxicity explanations, and ContExTox-Decoder, a context-aware model for detecting and explaining toxicity, trained on our dataset. We will use the Dynabench framework to improve ContExTox-Decoder’s robustness, by iteratively enhancing ContExTox with adversarial context-statement pairs that fool the current model. Specifically, we will ask workers to generate adversarial conversational contexts (e.g., preceding utterances) and social contexts (e.g., demographic identities, social roles) for statements, such that the pragmatic interpretations of the statements' toxicity are altered.
We’d also like to thank our Meta colleagues for supporting proposal selection: Melissa Hall, Dieuwke Hupkes, Patrick Lewis, Pedro Rodriguez, and Candace Ross.