Artificial Intelligence (AI) is central to today’s Facebook experiences. It is woven intricately into the platform such that nearly everything people see and do is informed by AI and machine learning. We have trained computer vision models to accelerate our work with AI to help people communicate, and we’re just getting started on this journey.
With over 2 billion active users on Facebook each month, and substantial growth in Instagram and Messenger, it’s vital that we continually advance our level of content understanding. We do this continually by applying novel computer vision technologies, with the ultimate goal of generalizing AI. At Facebook, vision research topics span everything from computational photography, visual dialog, content and image understanding and virtual reality, to satellite imagery.
This week Facebook and Oculus researchers will share their latest computer vision work at the IEEE Computer Vision and Pattern Recognition (CVPR) Conference, in Honolulu, Hawaii. CVPR is the premier annual computer vision conference that brings together a community of academic and industry scholars.
Researchers will present findings across high-level topic areas: Object Recognition, Content Understanding, Large Scale Computer Vision for Remote Sensing Imagery, and Machine Learning.
Training and deploying AI models is often associated with massive data centers or super computers, with good reason. The ability to continually process, create, and improve models from all kinds of information: images, video, text, and voice, at massive scale, is no small feat. Deploying these models on mobile devices so they’re fast and lightweight can be equally daunting. Overcoming these challenges requires a robust, flexible, and portable deep learning framework. Recognizing that this is a problem that many researchers and innovators face, we’ve made the Caffe2 framework available to the community.
Caffe2 is a lightweight and modular deep learning framework emphasizing portability while maintaining scalability and performance.
At CVPR we will be holding a meetup, Monday, July 24th from 5-6pm to share and receive feedback about Caffe2. Researchers are also encouraged to apply for Caffe2 research awards to conduct research utilizing the framework with an emphasis on understanding intelligence and building intelligent systems for the research community. “We’re committed to providing the community with high-performance machine learning tools so that everyone can create intelligent apps and services,” said Facebook research scientist, Yangqing Jia.
Laurens van der Maaten and colleagues received a CVPR Best Paper Award for their paper, Densely Connected Convolutional Networks. The paper embraces the observation that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. And it introduces the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion.
The paper describes several compelling advantages of DenseNets such as alleviating the vanishing-gradient problem, strengthening feature propagation, encouraging feature reuse, and substantially reducing the number of parameters. Code and pre-trained models are available at https://github.com/liuzhuang13/DenseNet for the broader community to explore.
A joint contribution from Facebook researcher Ilke Demir and Ramesh Raskar of Facebook and MIT Media lab along with other colleagues, titled: Robocodes: Towards Generative Street Addresses from Satellite Imagery received a workshop best paper award. This work is being presented during the EarthVision Workshop that focuses on large scale computer vision for remote sensing imagery.
Maps are gaining more importance as the geospatial content around the world is growing; importantly, 70% of the world remains unmapped and there is no generative addressing scheme to automatically map unknown areas. Current solutions rely on extracting road geometry, but the novel approach explored in the paper conquers the semantic dimension of mapping, generating street addresses for every 5m x 5m.
Computer vision and remote sensing communities recently started to focus on learning important features from satellite imagery. Much of what was previously considered as largely theoretical research can now be used for real world impact.
“Our world and the technology to understand it are progressing from one dimensional data, such as signal, text, and speech, towards more dimensions such as images, videos, and 3D spaces. Computer vision techniques are bridging this gap between our multi-dimensional world and a person’s Facebook world. Our work aims to bring the world to the user,” said Ilke Demir, Facebook PostDoc Researcher.
“Winning the best paper award, helps us believe even more strongly that our system can be implemented to uniquely locate the rest of the world, especially in disaster zones and unmapped areas that lack urban infrastructure.”
Teaching computers to understand what’s in an image, and intersecting that with chatbot dialog work requires a new level of content understanding. Researchers will be presenting a number of papers on these topics at CVPR. In addition, Georgia Tech, Virginia Tech, and Facebook are hosting a Visual Question Answering Workshop on Wed, July 26th, to bring together a community of experts interested in Visual Question Answering to share state-of-the-art approaches, best practices, and future directions in multi-modal AI.
The purpose of the workshop is to deliver the 2nd edition of the Visual Question Answering Challenge on the 2nd edition (v2.0) of the VQA dataset introduced in Goyal et al., CVPR 2017. It will provide an opportunity to benchmark algorithms on VQA v2.0 and to identify state-of-the-art algorithms that need to truly understand the image content in order to perform well on this balanced VQA dataset.
If you are at CVPR be sure to check out all the work being presented by Facebook researchers and engineers, as well as stop by the exhibition hall to speak to them outside the sessions.
Aggregated Residual Transformations for Deep Neural Networks
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, Kaiming He
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick
Densely Connected ConvolutionalNetworks
Gao Huang, Zhuang Liu, Laurens van der Maaten, Kilian Q. Weinberger
Discovering Causal Signals in Images
David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, Leon Bottou
Feature Pyramid Networks for Object Detection
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, Serge Belongie
Hard Mixtures of Experts for Large Scale Weakly Supervised Vision
Sam Gross, Marc’Aurelio Ranzato, Arthur Szlam
Learning Features by Watching Objects Move
Deepak Pathak, Ross Girshick, Piotr Dollár, Trevor Darrell, Bharath Hariharan
Link the Head to the “Beak”: Zero Shot Learning From Noisy Text Description at Part Precision
Mohamed Elhoseiny, Yizhe Zhu, Han Zhang, Ahmed Elgammal
Relationship Proposal Networks
Ji Zhang, Mohamed Elhoseiny, Scott Cohen, Walter Chang, Ahmed Elgammal
Robocodes: Towards Generative Street Addresses from Satellite Imagery
Ilke Demir, Aman Raj, Forest Hughes, Kleovoulos Tsourides, Suryanarayana Murthy, Kaunil Dhruv, Sanyam Garg, Jatin Malhotra, and Barrett Doo, Divyaa Ravichandran, and Ramesh Raskar
Semantic Amodal Segmentation
Yan Zhu, Yuandong Tian, Dimitris Metaxas, Piotr Dollár
Visual Dialog
Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav, José M. F. Moura, Devi Parikh, Dhruv Batra