Today, Facebook AI Research (FAIR) open sourced DensePose, our real-time approach for mapping all human pixels of 2D RGB images to a 3D surface-based model of the body.
Recent research in human understanding aims primarily at localizing a sparse set of joints, like the wrists, or elbows of humans. This may suffice for applications like gesture or action recognition, but it delivers a reduced image interpretation. We wanted to go further. Imagine trying new clothes on via a photo, or putting costumes on your friend’s photos. For these tasks, a more complete, surface-based image interpretation is required.
The DensePose project addresses this and aims at understanding humans in images in terms of such surface-based models. Our work shows that one can efficiently compute dense correspondences between 2D RGB images and 3D surface models for the human body. Unlike common works in human pose estimation that operate with 10 or 20 human joints (wrists, elbows, etc), this work accounts for the entirety of the human body, defined in terms more than 5000 nodes. The resulting speed and accuracy of our system accelerates connections with augmented and virtual reality.
Earlier works on this problem would require computation in the order of minutes, initialization by an external system e.g. for human joint localization, while being particularly brittle. DensePose operates at multiple frames per second on a single GPU and can handle tens or even hundreds of humans simultaneously.
To do this, we introduced DensePose-COCO, a large-scale ground-truth dataset with image-to-surface correspondences manually annotated on 50K persons from the COCO dataset. The ground-truth comes in the form of image-to-surface correspondences on a randomly sampled set of human body positions, alongside with segmented body parts. We follow the exact same train/val/test split as in the COCO challenge.
DensePose-COCO annotations: given an RGB image we associate multiple pixels of every person with UV coordinates.
DensePose-COCO annotations: we associate multiple pixels of every person with positions on a 3D surface.
We also developed a novel, deep network architecture for our task. We build on FAIR’s Detectron system and extend it to incorporate dense pose estimation capabilities. As in Detectron’s Mask-RCNN system, we use Region-of-Interest Pooling followed by fully-convolutional processing. We augment the network with three output channels, trained to deliver an assignment of pixels to parts, and U-V coordinates. The resulting architecture is effectively as fast as Mask-RCNN, thanks to Caffe2.
DensePose-RCNN architecture: we use a cascade of region proposal generation and feature pooling, followed by a fully-convolutional network that densely predicts discrete part labels and continuous surface coordinates.
Our goal in open sourcing DensePose is to make our research accessible and as open as possible. We hope that DensePose brings researchers and developers across Computer Vision, Augmented Reality and Computer Graphics closer together and will soon give rise to new experiences — whether it’s creating whole-body filters or learning new dance moves from your cell phone.
DensePose is available under the Creative Commons license on GitHub We’re also releasing performance baselines for multiple pre-trained models alongside with the ground-truth information for DensePose-COCO.