Loading video...
Video Failed to Load
Introducing HO-Cap: A Capture System and Dataset for 3D Reconstruction and Pose Tracking of Hand-Object Interaction. We built a multi-camera system and a semi-automatic method for annotating the shape and pose of hands and objects Project page:
57,228 views • 1 year ago •via X (Twitter)
8 Comments

The capture system consists of 8 Intel RealSense D455 cameras and 1 Microsoft Azure Kinect positioned above a table. All the cameras are calibrated. Users wear a Microsoft HoloLens AR headset during data collection

First, we use BundleSDF to reconstruct the textured meshes of objects. To prepare the input data, we manually move and rotate an object in front of the Azure Kinect camera, ensuring exhaustive coverage of the surfaces for high fidelity reconstruction.

64 objects are reconstructed in the HO-Cap dataset

In our semi-automatic annotation pipeline, we use FoundationPose for initial object pose estimation, MediaPipe for hand pose estimation followed by joint optimization of hands and objects based on SDF optimization

Finally, the HO-Cap dataset provides segmentation masks and poses of hands and objects in the collected 64 videos, including first-person view videos from HoloLens

This annotation pipeline has limitations: 1) BuddleSDF cannot reconstruct some objects very well, 2) MediaPipe occasionally fails to detect hand joints, 3) Object pose estimation may fail when small objects are held within the hand. Failed videos are excluded from the dataset

This project is led by Jikai Wang from IRVL at UT Dallas, in collaboration with Yu-Wei Chao @yu_wei_chao and Bowen Wen @bowenwen_me at NVIDIA. A toolbox to use the dataset is available at

In addition to use this dataset to study vision problems, another motivation to is to use the data as human demonstrations for dexterous robot manipulation. These manipulation actions in HO-Cap are very difficult for the current robots, a challenge for future robotic systems
