Object detection and recognition is a fundamental research topic in the computer vision community. Thanks to major datasets and benchmark efforts [3, 4, 6, 7, 11], today’s object understanding systems have demonstrated prominent capabilities to detect and recognize objects under various levels of supervision and data distributions. Though the progress is exciting, existing video datasets [12, 13, 14, 15] are often exocentric, using media captured from third-person viewpoints. Therefore, they do not properly represent the data distributions under egocentric viewpoints, a domain critical to the success of many metaverse initiatives here at Meta.
Efforts have been made to build egocentric datasets [1, 2, 5, 8], which provide a first-person viewpoint into the real world. From the object understanding perspective, however, there still exist several gaps with those datasets. First, some datasets, such as Ego4D [5], are not object-centric, where objects are only annotated in a subset of the dataset, and each annotated object may appear and disappear quickly in the video. These datasets lack enough samples to represent the diversity of objects seen in egocentric videos, such as background, viewing distance, angle, lighting conditions, and video capturing devices. Thus, they are less suitable for training object understanding models that can make predictions robust to different scenarios in egocentric data. Second, some datasets contain object annotations only from a limited set of semantic categories or a specific domain. For example, Objectron [1] and Co3D [8] contain objects from fewer than 50 categories, while Epic-Kitchens [2] contains only objects in the kitchen. Third, the objects in existing datasets are often annotated at the category level, so the object understanding models built on top of them are often unable to distinguish between two object instances from the same category. Instance-level object understanding capabilities, such as distinguishing different mugs, are important in delivering personalized experiences in the metaverse.
To address these gaps, we introduce EgoObjects, the first large-scale egocentric dataset dedicated to objects — pushing the frontier of object understanding for the metaverse. We built the dataset with the following key features:
In this blog, we’ll introduce the EgoObjects pilot version, consisting of 9,273 videos (30+ hours) for 368 categories of objects, with a total of 654K annotations. Portions of the dataset contain 14.4K unique instances. Below, we will present more details on the EgoObjects collection, annotation, and its characteristics. We’ll also present benchmarking results on two tasks, including the traditional object category-level detection and a novel object instance-level detection. In doing those, we hope to share insights on the unique research opportunities that EgoObjects brings, and how EgoObjects can accelerate the object understanding research in egocentric vision.
We worked with third-party vendors who recruited participants from 25 countries wearing smart glasses or holding mobile phones with ultrawide lenses near their forehead to capture multiple videos of common household objects. We provided a list of ~400 object categories (for the main objects) as a seed set and asked participants to find these objects in their households.
EgoObjects contains four types of video capture variations, including object distance, camera motion, background, and lighting. First, we consider three different object distances based on object scale and frame scale. Object scale is the longest object dimension, and frame scale is the shorter video frame dimension. We define near object distance as object-scale/frame-scale ratio as greater than 30 percent, medium as 20 percent to 30 percent, and far as otherwise. Second, we prescribed three different camera motions to capture different viewpoints of the same object. Horizontal refers to a camera moving from left to right or right to left, vertical indicates a camera moving upward or downward, and combined refers to a camera moving in both horizontal and vertical motions. Third, we captured videos with backgrounds of two complexities: simple and busy. The simple background has at least three but no more than five surrounding objects near the main object, whereas the busy background has at least five nearby objects. The placement of the objects are natural in both simple and busy backgrounds. Last, there are bright and dim lighting conditions. Bright lighting is when a light meter reads more than 250 lux, while dim lighting is when the light meter reads less than that.
Each participant collected 10 videos for each main object as shown in Table 2, and each video is 10 seconds. Each video contains rich metadata, including participant identifier, main object category, video identifier, location, background description, capture time, and lighting.
In the continuous collection of the EgoObjects dataset, its object taxonomy also grows dynamically. In this process, we carefully handled the resulting challenges in the data annotation, such as partially overlapping visual concepts, parent-child relationships, and synonyms. LVIS [6] proposed the concept of federated datasets and a six-stage annotation pipeline to address issues of the large vocabulary dataset. However, the LVIS pipeline operates on a static set of images from COCO [7], which means the output of each stage is also static, and sampling and iterative operations can be applied. On the other hand, EgoObjects is dynamic, where 1) the number of videos is growing daily, and 2) the category set and the set of positive/negative samples per category change continuously. Naively adopting the LVIS pipeline does not work. For example, LVIS stage 1 will fail because it needs to go through the entire dataset several times.
Therefore, we designed a three-stage federated annotation pipeline specifically for our dynamic data collection. Figure 2 illustrates the annotation pipeline by showing the output of each stage, described below. This pipeline is applied to frames sampled from videos at one frame per second. In addition to the category labels, we also assign an instance identifier to each object, track the same object instance in each video, and identify each object instance across all videos in the dataset.
This stage aims to identify the existing categories in the image from a fixed vocabulary 𝒱 of 638 categories. Annotators are asked to categorize into at least five categories, including the main object category highlighted by the collecting participant, from 𝒱 for each image, and those of nearby objects.
Given the categories discovered in stage 1, the next step is to exhaustively annotate the objects for each category. In this stage, each image is reviewed by three annotators to draw bounding boxes for all instances of every identified category. Each bounding box is assigned a label pair (c, i) where c is the category name and i the instance identifier. Note that the instance identifier is consistent across all frames in the 10 videos of the main object. On each frame, we define a multi-review IoU metric that measures the IoU scores over boxes from each annotator averaged across all the boxes. Then this metric is further averaged across all frames, and the annotation with the highest averaged IoU is chosen as the final bounding box.
The final stage of the pipeline is to collect a set of negative categories for each image. The negative category is defined as “no object of that category exists in the image.” We do this by randomly sampling categories from the vocabulary 𝒱 and ask annotators to verify them. If any of the three annotators reports that at least one object of a category exists in the image, we remove that category from the list.
To this end, we get a set of exhaustively annotated object bounding boxes for at least five categories and a list of negative categories per frame. On average, annotators spend ~120s on each frame, including ~90s for stage 1 and stage 2, and ~30s for stage 3.
In the pilot data, we discovered a total of 14,400 instances from 368 categories. Among them were 1,292 main object instances from 206 categories and 13,108 secondary object instances from 353 categories. On average, each image was annotated with 5.64 instances from 4.84 categories, and each instance appeared in 44.8 images. For the main object, each instance appeared in 95.9 images, whereas each secondary instance appeared in 39.8 images, on average. Figure 3 shows the number of instances per category (left) and the number of annotations (bounding boxes) per category (right).
Our data collection process encourages main objects to be distributed throughout the entire image plane. This is verified in Figure 4, which shows the density of the instance center for both main object instances (left) and secondary object instances (right). Main objects have more diverse spatial distribution than secondary objects.
Compared with COCO and LVIS, EgoObjects has more large and medium-size objects, revealing that it pays more attention to closer objects, which the user is more likely to interact with. Figure 5 shows the relative size distribution of object bounding boxes.
We also computed the statistics of the collected 9K+ videos and demonstrated their diversity in location, lighting, and background in Figure 6.
EgoObjects naturally supports traditional object category-level detection tasks, and we benchmarked baseline models on it. Specifically, we split the dataset into a training set and a validation set, with 448K and 16K object annotations from 79K and 3K images, respectively. In total, there are 368 categories. We benchmarked Faster-RCNN [9] models, using both ResNet-50 and ResNet-101 backbones.
EgoObjects enables a more fine-grained task of object detection at instance level. We propose to address this task with continual learning. A baseline detector is designed based on Faster R-CNN [9] with two modes: object registration and object detection. In the object registration mode, new object instances are added to an object index, including both the raw image, and extracted features. The model does not require any retraining. Then, in the object detection mode, the detector should be able to detect object instances already registered in the object index and predict their instance ID.
To be more specific, in the object registration phase, the input is an image and a bounding box enclosing the object, whose features are then extracted from the detector backbone using RoIAlign [10]. For each registered object, we store instance ID and RoI features in the index. In the object detection phase, the input is a query image and RoI features of all registered objects. The features of the query image are extracted by the detector backbone. The RoI features of each registered object are correlated with the query features, which results in a heat map of the target object. From the heat map modulated by the query features, both a confidence score and a bounding box are predicted, representing the prediction for the registered object on the query image.
We train the detector on a split of 9.5K object instances from 79K images. During evaluation, we registered 4K instances from 251 categories, which were unseen during training and tested on the preliminary mini query split with 3K images. We report several detection metrics in the following table from models with different backbones.
Compared to the results in Table 3, the baseline models perform worse on the instance-level detection task than the category-level detection tasks. This is expected because distinguishing between individual object instances is more challenging than just recognizing the category label.
EgoObjects data collection and annotation will continue to further increase in scale and diversity. More state-of-the-art category–level and instance-level detectors under more settings will be benchmarked to understand their strengths and limitations. Additionally, we will open-source part of the dataset to the research community and facilitate the wider research community to tackle the challenges in egocentric object understanding, including image data, annotations, and dev kit.