The Facebook AI Camera Team is working on various computer vision technologies and creative tools to help people express themselves. For example, with real-time “style transfer”, you can give your photos or videos the look of a Van Gogh painting. With real-time face tracker, you can add makeup or even replace your face with an avatar. So what if you could replace your entire body with an avatar?
To replace the entire body with an avatar, we will need to accurately detect and track body movements in real time. This is a very challenging problem due to large variations in poses and identities. A person might be sitting, walking or running. She or he might be wearing a long coat or shorts. And a person is often obstructed by other people or objects. All of these factors dramatically increase the difficulty of a robust body tracking system.
We recently developed a new technology that can accurately detect body poses and segment a person from their background. Our model is still in research phase at the moment, but it is only a few megabytes, and can run on smart phones in real time. Someday, it could enable new applications many new applications such as creating body masks, using gestures to control games, or de-identifying people.
Our human detection and segmentation model is based on the Mask R-CNN framework — a conceptually simple, flexible, and general framework for object detection and segmentation. It can efficiently detect objects in an image, while simultaneously predicting key points and generating a segmentation mask for each object. The Mask R-CNN framework won the best paper award in ICCV 2017. To run Mask R-CNN models in realtime in mobile devices, researchers and engineers from Camera, FAIR and AML teams work together and build an efficient and light-weighted framework: Mask R-CNN2Go.
Mask R-CNN2Go model consists of five major components.
Unlike modern GPU servers, mobile phones have limited computational power and storage. The original Mask R-CNN model is based on ResNet, which is too big and too slow to run on mobile phones. To solve this problem, we developed a very efficient model architecture optimized for mobile devices.
We applied several approaches for reducing the model size. We optimize the number of convolution layers and the width of each layer, which is the most time-consuming part of processing. To ensure a large enough receptive field, we use a combination of kernel sizes including 1×1, 3×3 and 5×5. Weight pruning is also used to reduce the size. Our final model is only a few megabytes and is very accurate.
To run deep learning algorithms real-time we leverage and optimize our core framework, Caffe2 with NNPack, SNPE and Metal. By utilizing a mobile CPU and GPU libraries including NNPack, SNPE and Metal, we are able to significantly improve the mobile computation speed. All of these are done with a modular design, without changing the general model definition. As a result, we get both small model size and fast runtime, and avoid potential incompatibilities.
Facebook AI Research (FAIR) recently published the Mask R-CNN research platform (Detectron). We have open-sourced implementation of Caffe2 operators (GenerateProposalsOp, BBoxTransformOp, BoxWithNMSLimit, and RoIAlignOp) and necessary model conversion code for model inference for the community to use.
Developing computer vision models for mobile devices is a challenging task. A mobile model has to be small, fast and accurate without large memory requirements. We will continue exploring new model architectures which will lead to more efficient models. We will also explore models that can better fit in mobile GPUs and DSPs which has the potential to save both the battery and computational power.