Michael Black's banner

Michael Black

@Michael_J_Black • 100,134 subscribers

VP Digital Human Research, Epic Games. Emeritus Director, Max Planck Institute for Intelligent Systems (@MPI_IS). Opinions are my own.

Shorts

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

22,144 views

Given a monocular video as input, #HOLD reconstructs 3D hand and object surfaces for every frame without assuming a known object template. Our key insight is that interacting hands and objects provide complementary cues about each other's shape and pose. 1/4

Given a monocular video as input, #HOLD reconstructs 3D hand and object surfaces for every frame without assuming a known object template. Our key insight is that interacting hands and objects provide complementary cues about each other's shape and pose. 1/4

21,594 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

WHAM defines the new state of the art in 3D human pose estimation from video. By a large margin. It’s fast, accurate, and it computes human pose in world coordinates. It’s also the first video-based method to be more accurate than single-image methods. 1/8

WHAM defines the new state of the art in 3D human pose estimation from video. By a large margin. It’s fast, accurate, and it computes human pose in world coordinates. It’s also the first video-based method to be more accurate than single-image methods. 1/8

118,463 views • 2 years ago

There's a problem with 3D human pose & shape (HPS) estimation methods. You either get good 3D accuracy or good alignment with the image, but not both. Why? The current top methods use the wrong camera model. TokenHMR at #CVPR2024 analyzes the issue and presents a solution. (1/8)

There's a problem with 3D human pose & shape (HPS) estimation methods. You either get good 3D accuracy or good alignment with the image, but not both. Why? The current top methods use the wrong camera model. TokenHMR at #CVPR2024 analyzes the issue and presents a solution. (1/8)

80,500 views • 2 years ago

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

81,406 views • 2 years ago

The BEDLAM2.0 dataset (B2) is here, just in time to train your 3D human pose and shape estimation methods for CVPR. B2 goes beyond BEDLAM (B1) to include widely varied and natural camera motions and fields of view, more diverse body shapes, strand-based hair, more garments, shoes, more body motions, and more 3D scenes. Compared with B1, training on B2 produces more accurate 3D human pose, resulting in SOTA accuracy, particularly for estimates in world coordinates. B2 lets you jointly train camera motion and human motion regressors, and we also provide depth maps. Check out data, code, dataset statistics, and much more. BEDLAM2.0 will appear in the 2025 NeurIPS Datasets and Benchmarks Track. Joint work with Joachim Tesch, Giorgio Becherini, Prerana Achar, Anastasios Yiannakidis, Muhammed Kocabas, Priyanka Patel.

The BEDLAM2.0 dataset (B2) is here, just in time to train your 3D human pose and shape estimation methods for CVPR. B2 goes beyond BEDLAM (B1) to include widely varied and natural camera motions and fields of view, more diverse body shapes, strand-based hair, more garments, shoes, more body motions, and more 3D scenes. Compared with B1, training on B2 produces more accurate 3D human pose, resulting in SOTA accuracy, particularly for estimates in world coordinates. B2 lets you jointly train camera motion and human motion regressors, and we also provide depth maps. Check out data, code, dataset statistics, and much more. BEDLAM2.0 will appear in the 2025 NeurIPS Datasets and Benchmarks Track. Joint work with Joachim Tesch, Giorgio Becherini, Prerana Achar, Anastasios Yiannakidis, Muhammed Kocabas, Priyanka Patel.

27,523 views • 8 months ago

Upgrade your expressive 3D human avatars from #SMPL-X to #SUPR, our latest and greatest body model. SUPR is trained from 1.2M 3D scans, is more expressive, and includes feet with articulation and compression. Code by @NeelayShah8, video by Anastasios Yiannakidis.

Upgrade your expressive 3D human avatars from #SMPL-X to #SUPR, our latest and greatest body model. SUPR is trained from 1.2M 3D scans, is more expressive, and includes feet with articulation and compression. Code by @NeelayShah8, video by Anastasios Yiannakidis.

60,553 views • 3 years ago

Train your avatars to interact with 3D scenes. We use adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a natural and life-like manner. Today at #SIGGRAPH2023.

Train your avatars to interact with 3D scenes. We use adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a natural and life-like manner. Today at #SIGGRAPH2023.

46,237 views • 2 years ago

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

21,696 views • 1 year ago

You've been asking for it... Soyong Shin has put trained the #WHAM model on-line. Please give it a try for capturing 3D human motion and let us know how it goes.

You've been asking for it... Soyong Shin has put trained the #WHAM model on-line. Please give it a try for capturing 3D human motion and let us know how it goes.

14,279 views • 2 years ago

No more content to load