October 20, 2017

Facebook at ICCV 2017

By: Meta Research

Computer vision experts from around the world will gather in Venice this week at the International Conference on Computer Vision (ICCV) to present the latest advances in computer vision and related areas. Research from Facebook will be presented in 15 peer-reviewed publications and posters.  Facebook Researchers will also be leading and presenting numerous workshops and tutorials.

Here is a complete list of Facebook research being presented at ICCV by research topic.

Semantic and image segmentation

Mask R-CNN – Marr Prize, Best Paper Award Winner
Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick

This paper develops a new system that, for each pixel in a photo, can predict to what object it corresponds as well as which object it corresponds to. So the system will not only outline sheep and tell you that they are sheep (“semantic” segmentation), but it will also tell you which parts of the image correspond to which sheep (“instance” segmentation). Mask R-CNN is one of the first systems to successfully do this. Mike Schroepfer, Facebook’s CTO, showed several demonstrations of Mask R-CNN in his keynote at F8 earlier this year.

Predicting Deeper into the Future of Semantic Segmentation
Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, Yann LeCun

The paper develops a deep learning model that, from a particular frame (still) in a video, tries to predict what the next frames will look like. So in a sense, it is trying to guess what the future of the video will look like. The paper shows that the resulting model can be used to improve the quality of computer-vision systems in tasks such as semantic segmentation.

Segmentation-Aware Convolutional Networks Using Local Attention Masks
Adam W. Harley, Konstantinos G. Derpanis, Iasonas Kokkinos

The neurons within a convolutional network look at increasingly large parts of the image as one goes deeper in the network. This can give poorly localized, blurred responses because the neurons looks at very large parts of the image. In this work, we sharpen such responses by making every neuron “attend” only to the region of interest.

Dense and Low-Rank Gaussian CRFs Using Deep Embeddings
Siddhartha Chandra, Nicolas Usunier, Iasonas Kokkinos

Although convolutional networks can very accurately in classify pixels in an image into different categories (car, airplane, …), neighboring decisions are often not coherent: half of an object may be labeled as “bed” and the other half as “sofa”. This paper proposes a technique that couples the classifications of all pixels to produce coherent predictions in a very efficient manner.

Object detection

Focal Loss for Dense Object Detection – Best Student Paper Award
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár

This paper presents a new system for object detection. The system is technically different from the current state-of-the-art in object detection. Whereas most other systems consist of multiple “stages”, each of which are implemented by a different model, this paper develops a model that solves the entire object-detection problem in a single stage. This simplicity is appealing because it makes the system much easier to implement and use.

Low-shot Visual Recognition by Shrinking and Hallucinating Features
Bharath Hariharan, Ross Girshick

Object detection systems are generally trained on thousands of example images of each of the objects they need to recognize. This paper focuses on the problem of recognizing a new object class after only seeing very few examples of that class. It does so by “hallucinating” additional examples of the object we want to learn to recognize.

Transitive Invariance for Self-supervised Visual Representation Learning
Xiaolong Wang, Kaiming He, Abhinav Gupta

The paper proposes to learn better models for object detection by observing how the appearance of objects changes in a video. For instance, a video of a car driving by shows the car from different sides in different frames. Because you know each of the frames depicts the same car, you can use this information to learn models that better understand different views of the same object. The resulting models can then be used to improve object detectors.

Image classification

Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, Dhruv Batra

Most modern image classification systems are based on a model called convolutional networks. These networks work very well, but they are also very much a “black box”. This paper develops a new technique that lets you “open the box” by visualizing what regions in the photo let the system to classify it in a particular way.

Learning Visual N-Grams from Web Data
Ang Li, Allan Jabri, Armand Joulin, Laurens van der Maaten

Most image-recognition systems are trained on large collections of images that are manually annotated. This annotation process is cumbersome and does not scale. This paper develops an image-recognition system that is trained on 50 million photos and user comments without manual annotations. The system can recognize objects, landmarks, and scenes that span multiple words, such as “Golden Gate Bridge” or “Statue of Liberty”.

Combining vision and language

Inferring and Executing Programs for Visual Reasoning
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick

This paper considers visual reasoning problems: given an image, it aims to answer questions such as “what is the shape of the thing in front of the blue box?”. It does so by using a “module network” that converts the question into a simple computer program, and implements each instruction in that program using a neural network. The paper also presents a new dataset for visual reasoning, called CLEVR-Humans.

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning
Abhishek Das, Satwik Kottur, Jos. M. F. Moura, Stefan Lee, Dhruv Batra

This paper develops a chatbot for answering questions about images. You can ask this chatbot things like “What is the color of the woman’s umbrella?”. If there are two women in the image, the chatbot will ask: “Which woman?”. You respond: “The one with the dark hair.”, and the chatbot will tell you: “The umbrella is blue”. We are still very far from really solving this problem, but this is one of the first papers that tries to address it.

Learning to Reason: End-to-End Module Networks for Visual Question Answering
Ronghang Hu, Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Kate Saenko

This paper describes a new technique for answering questions such as “what is the color of the ball to the left of the purple cylinder?”. The technique does so by converting the question into a small computer program; each instruction in the program is then executed by a neural network. Both the program “generator” and the program “executor” are learned from pairs of images and questions.

Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training
Rakshith Shetty, Marcus Rohrbach, Lisa Anne Hendricks, Mario Fritz, Bernt Schiele

This paper deals with the problem of automatically generation a caption, that is, a natural-language description of an image. The main technical innovation is that it tries to make the captions generated by the system look more like captions that would have been generated by humans.

Image generation

Unsupervised Creation of Parameterized Avatars
Lior Wolf, Yaniv Taigman, Adam Polyak

This paper develops a new system that, based on a regular photo of your face, generates an avatar that looks just like you. The main technical innovation is that the system is trained in an “unsupervised” way. This means that it was not trained on pairs of face images and corresponding avatars: all it has seen are a bunch of faces and a bunch of avatars. The system learns to figure out automatically which avatars correspond to which faces.

3D Vision

Deltille Grids for Geometric Camera Calibration
Hyowon Ha, Michal Perdoch, Hatem Alismail, In So Kweon, Yaser Sheikh

Three-dimensional models of objects are used, for instance, in virtual reality. These models are made by photographing the object in a “dome” than contains hundreds of cameras that all make a picture at the same time. These cameras need to be calibrated, so that the system that combines all the images into a 3D model of the objects knows exactly where the cameras are located. For decades, this calibration has been done by photographing a standard checkerboard. This paper shows that by using a checkerboard with triangular fields, cameras can be calibrated more accurately.

Other Facebook research activities at ICCV 2017

Instance-Level Visual Recognition Tutorial
Talks by Georgia Gkioxari, Kaiming He, and Ross Girshick

Closing the Loop between Vision and Language Workshop
Larry Zitnick, Opening keynote
Dhruv Batra, Invited talk

Generative Adversarial Networks tutorial
Soumith Chintala presents his GANs-in-the-wild paper

Role of Simulation in Computer Vision workshop
Devi Parikh, Invited talk

Workshop on Web-Scale Vision and Social Media
Ang Li, Invited talk on his Facebook internship project

Workshop on Computer Vision for Virtual Reality
Organized by Frank Dellaert and Richard Newcombe

COCO + Places Workshop
Team FAIR presents its competition submission

PoseTrack Challenge Workshop
Yaser Sheikh, Invited talk
Georgia Gkioxari, Rohit Girdhar, Du Tran, Lorenzo Torresani and Deva Ramanan present their challenge submission