A Practical Stereo Depth System for Smart Glasses
We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with...
We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with...
In this paper, we present AGRoL, a novel conditional diffusion model specially purposed to track full bodies given sparse upper-body tracking signals. Our model uses a simple...
To address these issues, we propose a novel framework Feature Representation Learning with adaptive Displacement Generation and Transformer fusion (FRL-DGT), in which a...
In this work, we propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audiovisual speech via a transformer-based architecture, and then converts...
We reduce the quantization loss of a given image representation by making imperceptible changes to the image before its release. The loss is back-propagated through the deep...
Hence, in this work, we investigate the use of automatically-generated transcriptions of unlabelled datasets to increase the training set size. For this purpose, we use...
This paper introduces a new large consent-driven dataset aimed at assisting in the evaluation of algorithmic bias and robustness of computer vision and audio speech models in...
In this blog, we’ll introduce the EgoObjects pilot version, consisting of 9,273 videos (30+ hours) for 368 categories of objects, with a total of 654K annotations.
In this work, as the first attempt, we initiate to repair DNNs by jointly optimizing the architecture and weights at a higher (i.e., block) level. We first perform empirical...
Meta AI is sharing new research to reduce the latency of existing Vision Transformer (ViT) models without the need for additional training. Our approach, called Token Merging...