BANMo: Building Animatable 3D Neural Models from Many Casual Videos

Conference on Computer Vision and Pattern Recognition (CVPR)


Prior work for articulated 3D shape reconstruction often relies on specialized multi-view and depth sensors or pre-built deformable 3D models. Such methods do not scale to diverse sets of objects in the wild. We present a method that requires neither of them. It aims to create high-fidelity, articulated 3D models from many casual RGB videos in a differentiable rendering framework. Our key insight is to merge three schools of thought: (1) classic deformable shape models that make use of articulated bones and blend skinning, (2) canonical embeddings that establish correspondences between pixels and a canonical 3D model, and (3) volumetric neural radiance fields (NeRFs) that are amenable to gradient-based optimization. We introduce neural blend skinning models that allow for differentiable and invertible articulated deformations. When combined with canonical embeddings, such models allow us to establish dense correspondences across videos that can be self-supervised with cycle consistency. On real and synthetic datasets, our method shows higher-fidelity 3D reconstructions than prior works for humans and animals, with the ability to render realistic images from novel viewpoints.

Visit our project page here.

Latest Publications