November 1, 2017

Visual reasoning and dialog: Towards natural language conversations about visual data

By: Dhruv Batra, Devi Parikh

The broad objective of visual dialog research is to teach machines to have natural language conversations with humans about visual content. This emerging field brings together aspects of computer vision, natural language processing, and dialog systems research.

In general, dialog systems can have a spectrum of capabilities. On one end of the spectrum are task-driven chat bots you can talk to for a specific goal e.g., to book a flight. On the other end are chitchat bots that you can talk to about any topic but without a clear goal in mind. Visual dialog lies somewhere in between the two extremes. It is free-form dialog but the conversation is grounded in the content of a specific image.

Future application: An intelligent agent, uses its vision capabilities and natural language interface to assist a person.

While visual dialog research is in the early phases, there are many potential future use cases for this technology. For example, being able to ask a series of questions could help visually-impaired people understand images that are posted online or taken of their surroundings, or allow medical personnel to better interpret medical scans. It could also have uses in AR/VR applications where a user could chat in natural language and work with a virtual companion who is seeing what they are seeing based on a visual common ground.

Future application: A virtual companion seeing based on a visual common ground.

There are many basic research challenges to build these kinds of systems. We have recently pursued two research directions: (1) explicit reasoning about visual content and (2) human-like visual dialog.

Explicit reasoning about visual content

One central language interface to visual data is asking a natural language question, such as: “What animal is in the image?” or “How many people are sitting on the bench?” While each question involves solving a different task, most state-of-the-art systems rely on monolithic approaches which use the same computation graph or network to compute the answer. However, these models offer limited interpretability and might not be effective for more complex reasoning tasks, such as answering the question: “How many things are the same size as the ball?” as shown in the figure below.

Representing questions in a modular structure allows compositional and interpretable reasoning.

To address this task, researchers from UC Berkeley proposed “Neural Module Networks” at CVPR 2016 which decompose the computation into explicit modules. In the example above, a module “finds” or locates the ball, then another module “relocates” or looks for objects of the same size, and finally the last module counts “how many” there are. Importantly, modules are reused across images and questions, e.g., the “find ball” could also be used on another image to answer the question, “Are there more balls than cubes in the image?” As seen in the figure above, this also allows us to examine intermediate interpretable outputs in the form of “attention maps” which show where the model is looking.

While the original work relied on a non-differentiable natural language parser, two ICCV 2017 papers show how to train such a system end-to-end, which they found to be essential to answer challenging compositional questions in the CLEVR dataset which was presented at CVPR 2017 and publicly released by Facebook AI Researchers (FAIR) in collaboration with Stanford University.

The paper “Learning to Reason: End-to-End Module Networks for Visual Question Answering” first builds a policy or program with a encoder and decoder recurrent neural network (RNN) from the question. It then builds a modular network which is executed on the image to answer the question.

However, both papers propose different architectures. The first paper, Inferring and Executing Programs for Visual Reasoning, a collaboration between FAIR and Stanford University, uses different parameters for different modules but the same network structure. The second paper, Learning to Reason: End-to-End Module Networks for Visual Question Answering, a collaboration between UC Berkeley, Boston University, and FAIR, relies on different computations for different modules and sharing parameters by embedding the language of the question.

Despite their different architectures, both works find that it is necessary to supervise the program prediction with ground truth programs to get good results, although a small number of training examples can be sufficient. The Inferring and Executing Programs paper shows that using reinforcement learning to allow the network to learn the best program end-to-end provides significant improvements over the ground truth programs and allows fine-tuning to novel questions and answers.

Recently, two network architectures, RelationNet and FiLM, have been proposed which reach the same or better performance with a monolithic network without relying on any ground truth programs during training. This also means that they lose their explicit and interpretable reasoning structure. The Inferring and Executing Programs paper also collected human questions instead of synthetically-generated questions for the CLEVR dataset. Here, none of the work can show very good generalizability. Similarly, when evaluated on the Visual Question Answering (VQA) dataset with real images and questions, the Learning to Reason paper shows only limited performance improvement from program prediction, likely because the questions in the VQA dataset do not need as challenging reasoning as in the CLEVR dataset.

Overall, we are excited to explore new ideas in the future to build models which are truly compositional and interpretable to handle the challenges of new configurations and programs in real-world scenarios.

Human-like visual dialog

Dhruv Batra and Devi Parikh and their students at Georgia Tech and Carnegie Mellon University studied the problem of natural language dialog grounded in an image. They developed a novel two-person chat data-collection protocol to curate a large-scale Visual Dialog (VisDial) dataset containing one dialog with 10 question-answer pairs on 120,000 images, for a total of 1.2 million dialog question-answer pairs.

A demonstration of a visual dialog agent. The user uploads a picture. The agent begins by providing a caption for the picture “A large building with a clock tower in the middle”, and then answers a sequence of questions from the user.

Since this research is at the intersection of multiple fields, it is fostering new collaborations across disciplines to work together on common problems. To help further this line of research they made the visual dialog dataset and code publicly available for dialog researchers to create custom datasets for their problems.

One perhaps counter-intuitive aspect of research into dialog is that it often treats dialog as a static supervised learning problem, rather than the interactive agent learning problem. Essentially, at each round (t) of supervised training, the dialog model is artificially “injected” into the conversation between two humans and asked to answer a question. But the machine’s answer is thrown away because at the next round (t+1), the machine is again provided with the `ground-truth’ human-human dialog that includes the human response and not the machine response. Thus, the machine is never allowed to steer the conversation because that would take the dialog out of the dataset, making it non-evaluable.

To address this problem, researchers from Georgia Tech, Carnegie Mellon and FAIR introduce the first goal-driven training for visual question answering and visual dialog agents in their work, Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning (RL). They pose a cooperative “image guessing” game, GuessWhich, between two agents where a “questioner” Q-BOT and an “answerer” A-BOT communicate in natural language dialog. Before the game begins, A-BOT is assigned an image hidden from Q-BOT and both Q-BOT and A-BOT receive a natural language description of the image. At each subsequent round, Q-BOT generates a question, A-BOT responds with an answer, and then both update their states. At the end of 10 rounds, Q-BOT must guess the image by selecting it from a pool of images. We find that these RL-trained bots significantly outperform traditional supervised bots. Most interestingly, while the supervised Q-BOT attempts to mimic how humans ask questions, the RL-trained Q-BOT shifts strategies and asks questions that the A-BOT is better at answering, ultimately resulting in more informative dialog and a better team.

An alternative to goal-driven learning is to use an adversarial loss or perceptual loss which has to discriminate between human and agent-generated responses. This idea has been pursued in an upcoming NIPS 2017 paper by researchers from FAIR and Georgia Tech, Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model. Also relevant is the paper Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training by researchers from the Max Planck Institute for Informatics, UC Berkeley, and FAIR. This paper shows that having to generate multiple descriptions at once for a single image, instead of just one at a time, allows the model to learn how to generate more diverse and human-like descriptions of images.

Open multi-disciplinary collaboration needed

As humans, a major part of our brain-related function is through visual processing and natural language is how we communicate. Building AI agents that can connect vision and language is both exciting and very challenging. We discussed two research directions in this space: explicit visual reasoning and human-like visual dialog. While progress is being made, many challenges lie ahead. To move forward, it is critical to continue open, long term, basic multi-disciplinary research collaborations between FAIR researchers, academia and the entire AI ecosystem.


Dhruv Batra is a Research Scientist at Facebook AI Research (FAIR) and an Assistant Professor at Georgia Tech.

Devi Parikh is a Research Scientist at Facebook AI Research (FAIR) and an Assistant Professor at Georgia Tech.

Marcus Rohrbach is a Research Scientist at Facebook AI Research (FAIR).


References to this discussion. Several of these papers were recently presented at ICCV 2017.

VQA: Visual Question Answering Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, Devi Parikh (ICCV 2015)

Neural module networks J Andreas, M Rohrbach, T Darrell, D Klein (CVPR 2016)

Visual Dialog Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra (CVPR 2017)

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning Abhishek Das, Satwik Kottur, José M. F. Moura, Stefan Lee, Dhruv Batra
 (ICCV 2017)

Inferring and Executing Programs for Visual Reasoning Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick (ICCV 2017)

Learning to Reason: End-to-End Module Networks for Visual Question Answering Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko (ICCV 2017)

Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training R Shetty, M Rohrbach, LA Hendricks, M Fritz, B Schiele (ICCV 2017)

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, Larry Zitnick, Ross Girshick (CVPR 2017)

Best of Both Worlds: Transferring Knowledge from Discriminative Learning to a Generative Visual Dialog Model Jiasen Lu, Anitha Kannan, Jianwei Yang, Devi Parikh, Dhruv Batra (NIPS 2017)

A simple neural network module for relational reasoning Adam Santoro, David Raposo, David G.T. Barrett, Mateusz Malinowski, Razvan Pascanu, Peter Battaglia, Timothy Lillicrap (NIPS 2017)

FiLM: Visual Reasoning with a General Conditioning Layer Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville