bAbI

The bAbI project

This page gather resources related to the bAbI project of Facebook AI Research which is organized towards the goal of automatic text understanding and reasoning. The datasets we have released consist of:

  • The (20) QA bAbI tasks
  • The (6) dialog bAbI tasks
  • The Children’s Book Test
  • The Movie Dialog dataset
  • The WikiMovies dataset
  • The Dialog-based Language Learning dataset
  • The SimpleQuestions dataset
  • HITL Dialogue Simulator

The bAbI tasks

This section presents the first set of 20 tasks for testing text understanding and reasoning in the bAbI project. The tasks are described in detail in the paper:

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin and Tomas Mikolov. Towards AI Complete Question Answering: A Set of Prerequisite Toy Tasks, arXiv:1502.05698.

Please also see the following slides:

Antoine Bordes Artificial Tasks for Artificial Intelligence, ICLR keynote, 2015.

The aim is that each task tests a unique aspect of text and reasoning, and hence test different capabilities of learning models. More tasks are planned in the future to capture more aspects.

Training Set Size: For each task, there are 1000 questions for training, and 1000 for testing. However, we emphasize that the goal is to use as little data as possible to do well on the tasks (i.e. if you can use less than 1000 that’s even better) — and without resorting to engineering task-specific tricks that will not generalize to other tasks, as they may not be of much use subsequently. Note that the aim during evaluation is to use the _same_ learner across all tasks to evaluate its skills and capabilities.

Supervision Signal: Further while the MemNN results in the paper use full supervision (including of the supporting facts) results with weak supervision would also be ultimately preferable as this kind of data is easier to collect. Hence results of that form are very welcome. E.g. this paper does include weakly supervised results.

For the reasons above there are currently several directories:

  • 1) en/ — the tasks in English, readable by humans.
  • 2) hn/ — the tasks in Hindi, readable by humans.
  • 3) shuffled/ — the same tasks with shuffled letters so they are not readable by humans, and for existing parsers and taggers cannot be used in a straight-forward fashion to leverage extra resources– in this case the learner is more forced to rely on the given training data. This mimics a learner being first presented a language and having to learn from scratch.
  • 4) en-10k/ shuffled-10k/ and hn-10k/ — the same tasks in the three formats, but with 10,000 training examples, rather than 1000 training examples. Note the results in the paper use 1000 training examples.

The file format for each task is as follows:

ID text ID text ID text ID question[tab]answer[tab]supporting fact IDS. ...

The IDs for a given “story” start at 1 and increase. When the IDs in a file reset back to 1 you can consider the following sentences as a new “story”. Supporting fact IDs only ever reference the sentences within a “story”.

For Example:

1 Mary moved to the bathroom.2 John went to the hallway. 3 Where is Mary? bathroom 1 4 Daniel went back to the hallway. 5 Sandra moved to the garden. 6 Where is Daniel? hallway 4 7 John moved to the office. 8 Sandra journeyed to the bathroom. 9 Where is Daniel? hallway 4 10 Mary moved to the hallway. 11 Daniel travelled to the office. 12 Where is Daniel? office 11 13 John went back to the garden. 14 John moved to the bedroom. 15 Where is Sandra? bathroom 8 1 Sandra travelled to the office. 2 Sandra went to the bathroom. 3 Where is Sandra? bathroom 2
              

Versions: Some small updates since the original release have been made (see the README in the data download for more details). You can also get v1.0 and v1.1 here.

Data Stats: Some data statistics including overlap between train and test (which is minimal) can be found here. Code Code to generate tasks is available on github. We hope this will encourage the machine learning community to work on, and develop more, of these tasks.

The (6) dialog bAbI tasks

This section presents the set of 6 tasks for testing end-to-end dialog systems in the restaurant domain described in the paper:

Antoine Bordes, Y-Lan Boureau, Jason Weston, Learning End-to-End Goal-Oriented Dialog, arxiv:1605.07683.

Each task tests a unique aspect of dialog. Tasks are designed to complement the set of 20 bAbI tasks for story understanding of the previous section.

For each task, there are 1000 dialogs for training, 1000 for development and 1000 for testing. For tasks 1-5, we also include a second test set (with suffix -OOV.txt) that contains dialogs including entities not present in training and development sets.

The file format for each task is as follows:

ID user_utterance [tab] bot_utterance ... 

The IDs for a given dialog start at 1 and increase. When the IDs in a file reset back to 1 you can consider the following sentences as a new dialog. When the bot speaks two times in a row, we used the special token “<SILENCE>” to fill in for the missing user utterance. See more details in the README included with the dataset. The goal of the tasks is to predict the bot utterances, that can be sentences or API calls (sentences starting with the special token “api_call”). Here is an example of dialog (from Task 1):

1 hihello what can i help you with today 2 can you make a restaurant reservation with italian cuisine for six people in a cheap price rangei'm on it 3 <SILENCE>where should it be 4 rome pleaseok let me look into some options for you 5 <SILENCE>api_call italian rome six cheap 

The Children’s Book Test

This section presents the Children’s Book Test (CBT), designed to measure directly how well language models can exploit wider linguistic context. The CBT is built from books that are freely available thanks to Project Gutenberg. Details and baseline results on this dataset can be found in the paper:

Felix Hill, Antoine Bordes, Sumit Chopra and Jason Weston. The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations, arXiv:1511.02301.

After allocating books to either training, validation or test sets, we formed example ‘questions’ from chapters in the book by enumerating 21 consecutive sentences. In each question, the first 20 sentences form the context, and a word is removed from the 21st sentence, which becomes the query. Models must identify the answer word among a selection of 10 candidate answers appearing in the context sentences and the query. For finer-grained analyses, we evaluated four classes of question by removing distinct types of word: Named Entities, (Common) Nouns, Verbs and Prepositions

Here is an example of question (context + query) from Alice in Wonderland by Lewis Carroll:

Context: 1 So they had to fall a long way . 2 So they got their tails fast in their mouths . 3 So they could n't get them out again . 4 That 's all . 5 `` Thank you , " said Alice , `` it 's very interesting . 6 I never knew so much about a whiting before . " 7 `` I can tell you more than that , if you like , " said the Gryphon . 8 `` Do you know why it 's called a whiting ? " 9 `` I never thought about it , " said Alice . 10 `` Why ? " 11 `` IT DOES THE BOOTS AND SHOES . ' 12 the Gryphon replied very solemnly . 13 Alice was thoroughly puzzled . 14 `` Does the boots and shoes ! " 15 she repeated in a wondering tone . 16 `` Why , what are YOUR shoes done with ? " 17 said the Gryphon . 18 `` I mean , what makes them so shiny ? " 19 Alice looked down at them , and considered a little before she gave her answer . 20 `` They 're done with blacking , I believe . " Query: `` Boots and shoes under the sea , " the XXXXX went on in a deep voice , `` are done with a whiting ". Candidates: Alice|BOOTS|Gryphon|SHOES|answer|fall|mouths|tone|way|whiting Answer: gryphon 

The Movie Dialog dataset

This section presents the Movie Dialog dataset (MDD), designed to measure how well models can perform at goal and non-goal orientated dialog centered around the topic of movies (question answering, recommendation and discussion). Details and baseline results on this dataset can be found in the paper:

Jesse Dodge, Andreea Gane, Xiang Zhang, Antoine Bordes, Sumit Chopra, Alexander Miller, Arthur Szhttp://arxiv.org/abs/1511.06931lam, Jason Weston. Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems, arXiv:1511.06931.

The file format is again the same as in the bAbI tasks. The IDs for a given dialog start at 1 and increase. Each ID consists of one turn for each speaker (an “exchange”), which are tab separated. When the IDs in a file reset back to 1 you can consider the following sentences as a new conversation.

For Example:

1 Scarface, The Kite Runner, The Shining, Eternal Sunshine of the Spotless Mind, Avatar, Requiem for a Dream, and Lolita are movies I really like. I'm looking for a Drama movie. Dogville 2 Who does that star? Nicole Kidman, Lauren Bacall 3 I like Ray Milland movies more. Do you know anything else? The Thief 

The WikiMovies dataset

This includes only the QA part of the Movie Dialog dataset, but using three different settings of knowledge: using a traditional knowledge base (KB), using Wikipedia as the source of knowledge, or using IE (information extraction) over Wikipedia. This allows to test the ability of models to directly read documents to answer questions, and to compare this to traditional KBs in the same setting. See the paper for more details:

A. H. Miller, A. Fisch, J. Dodge, A. Karimi, A. Bordes, J. Weston. Key-Value Memory Networks for Directly Reading Documents, arXiv:1606.03126.

The Dialog-based Language Learning dataset

This section presents the Dialog-based Language Learning dataset, designed to measure how well models can perform at learning as a student given a teacher’s textual responses to the student’s answer (as well as potentially receiving an external real-valued reward signal). Details and baseline results on this dataset can be found in the paper:

Jason Weston. Dialog-based Language Learning, arXiv:1604.06045.

Here is an example dialog, the last number (0 or 1) is the external reward:

1 Mary moved to the bathroom. 0 2 John went to the hallway. 0 3 Where is Mary? bathroom 0 4 That's right. 1 5 Daniel went back to the hallway. 0 6 Sandra moved to the garden. 0 7 Where is Daniel? office 0 8 No, they are downstairs. 0 9 John moved to the office. 0 10 Sandra journeyed to the bathroom. 0 11 Where is Daniel? office 0 12 No, they are downstairs. 0 13 Mary moved to the hallway. 0 14 Daniel travelled to the office. 0 15 Where is Daniel? office 0 16 Correct! 1 17 John went back to the garden. 0 18 John moved to the bedroom. 0 19 Where is Sandra? garden 0 20 No, they are upstairs. 0 

The SimpleQuestions dataset

This section proposes SimpleQuestions, a dataset collected for research in automatic question answering with human generated questions. Details and baseline results on this dataset can be found in the paper:

Antoine Bordes, Nicolas Usunier, Sumit Chopra and Jason Weston. Large-Scale Simple Question answering with Memory Networks, arXiv:1506.02075.

The SimpleQuestions dataset consists of a total of 108,442 questions written in natural language by human English-speaking annotators each paired with a corresponding fact, formatted as (subject, relationship, object), that provides the answer but also a complete explanation. Facts have been extracted from the Knowledge Base Freebase. We randomly shuffle these questions and use 70% of them (75910) as training set, 10% as validation set (10845), and the remaining 20% as test set.

Here are some examples of questions and facts:

* What American cartoonist is the creator of Andy Lippincott? Fact: (andy_lippincott, character_created_by, garry_trudeau) * Which forest is Fires Creek in? Fact: (fires_creek, containedby, nantahala_national_forest) * What does Jimmy Neutron do? Fact: (jimmy_neutron, fictional_character_occupation, inventor) * What dietary restriction is incompatible with kimchi? Fact: (kimchi, incompatible_with_dietary_restrictions, veganism) 

Update (Dec. 15): v2 of the data set now includes the subset of Freebase used in the paper “Simple Question answering with Memory Networks”.

HITL Dialogue Simulator

The Human-in-the-loop (HTIL) Dialogue Simulator provides a framework for evaluating a bot’s ability to learn to improve its performance in an online setting using feedback from its dialog partner. The dataset contains questions based on the bAbI and WikiMovies datasets, but now with feedback from the dialog partner We include both simulated and human dialogs. Dialogs follow the same form as in the Dialog Based Language Learning datasets, but now depend on the model’s predictions via the simulator. Full details on this simulator are available in the following paper: J. Li, A. H. Miller, S. Chopra, M. Ranzato, J. Weston. Dialogue Learning With Human-In-The-Loop (forthcoming), arXiv:1611.09823.