Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

There's a problem with 3D human pose & shape (HPS) estimation methods. You either get good 3D accuracy or good alignment with the image, but not both. Why? The current top methods use the wrong camera model. TokenHMR at #CVPR2024 analyzes the issue and presents a solution. (1/8)

Michael Black

100,061 subscribers

80,500 просмотров • 2 лет назад •via X (Twitter)

Образование Наука и технологии Новости и политика #CVPR2024

Anya Rossi• Live Now

Private livecam show

Комментарии: 11

Фото профиля Michael Black

Michael Black2 лет назад

Current HPS methods use a simplified camera model that differs from the true camera. With the wrong camera, you have to distort the body pose or shape so that projected 3D features match the image. Estimating the true camera, however, is a challenging and unsolved problem (2/8)

Фото профиля Michael Black

Michael Black2 лет назад

Using BEDLAM, a synthetic dataset with perfect ground-truth (GT), we quantitatively evaluate the problem. With the HMR2.0 camera, we evaluate the 2D projection error of 3D bodies computed by HMR2.0 and GT bodies. With the wrong camera HM2.0 gets lower 2D error than GT. (3/8)

Фото профиля Michael Black

Michael Black2 лет назад

On the flip side, low 2D reprojection error results in worse 3D accuracy. For a given 2D image alignment error, there are effectively an infinite number of 3D poses that can produce this, and they can be really bad. Training a method with a 2D loss and wrong camera is bad. (4/8)

Фото профиля Michael Black

Michael Black2 лет назад

3D pseudo-GT that's estimated from 2D with the wrong camera has the same issue. To address this, we introduce two solutions. First, with 2D data, the loss should not try to fit it too well. Our new TALS loss penalizes large 2D errors while down-weighting small ones. (5/8)

Фото профиля Michael Black

Michael Black2 лет назад

With TALS, common pose priors have too much influence. Thus we use a VQ-VAE to convert continuous poses to a discrete token representation; trained on AMASS & MOYO. This pre-trained tokenizer provides a vocabulary of valid poses. Pose regression becomes classification. (6/8)

Фото профиля Michael Black

Michael Black2 лет назад

TokenHMR estimates 3D HPS using a discrete tokenized pose representation. Our TALS loss mitigates some of the bias caused by simplified camera models and biased pseudo-GT. This enables training on 2D data for robustness without losing 3D accuracy. (7/8)

Фото профиля Michael Black

Michael Black2 лет назад

Kudos to the authors: @saidwivedi, @yusun14567741, @PriyankaP1201, @YaoFeng1995 and @Michael_J_Black from @MPI_IS, @meshcapade and @ETH_en arXiv: Code and models are available at (8/8)

Фото профиля Tope Ibrahim

Tope Ibrahim2 лет назад

Dear Prof. Black, I will be attending the CVPR conference in Seattle between 18-21 of June as a first-timer. Over the years, you have remained one of the researchers I often draw inspiration from, and I will be very honoured to meet you in person.

Фото профиля Michael Black

Michael Black2 лет назад

I look forward to meeting you! Come find me at one of our posters.

Фото профиля Mathieu Tuli

Mathieu Tuli2 лет назад

Great work, excited to come chat at cvpr We’ll be there presenting FlowFace as well (face tracking from 2D video) would love to have you come by

Фото профиля Michael Black

Michael Black2 лет назад

Thanks for sharing this! I like the UV-flow idea. It combines two of my favorite things: 3D shape estimation and optical flow :) Fun fact: my very first paper on human faces used optical for expression recognition.

Похожие видео

WHAM defines the new state of the art in 3D human pose estimation from video. By a large margin. It’s fast, accurate, and it computes human pose in world coordinates. It’s also the first video-based method to be more accurate than single-image methods. 1/8

WHAM defines the new state of the art in 3D human pose estimation from video. By a large margin. It’s fast, accurate, and it computes human pose in world coordinates. It’s also the first video-based method to be more accurate than single-image methods. 1/8

Michael Black

118,463 просмотров • 2 лет назад

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 просмотров • 1 год назад

3D editing is hard: you need to ground an image + instruction and generate a faithful 3D shape in one forward pass -- no test-time optimization. So, we steer pretrained image-to-3D representations to do text-guided 3D edits; no massive 3D edit-pair dataset needed. Key trap: the “no-edit” solution is a nasty local minimum. We fix it with preference optimization, pushing the model to actually edit. Steer3D is the second work that adapts alignment ideas from LLMs to the 3D modality. SAM 3D also used DPO to improve its 3D generations.

3D editing is hard: you need to ground an image + instruction and generate a faithful 3D shape in one forward pass -- no test-time optimization. So, we steer pretrained image-to-3D representations to do text-guided 3D edits; no massive 3D edit-pair dataset needed. Key trap: the “no-edit” solution is a nasty local minimum. We fix it with preference optimization, pushing the model to actually edit. Steer3D is the second work that adapts alignment ideas from LLMs to the 3D modality. SAM 3D also used DPO to improve its 3D generations.

Georgia Gkioxari

116,061 просмотров • 7 месяцев назад

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

Michael Black

21,696 просмотров • 1 год назад

The BEDLAM2.0 dataset (B2) is here, just in time to train your 3D human pose and shape estimation methods for CVPR. B2 goes beyond BEDLAM (B1) to include widely varied and natural camera motions and fields of view, more diverse body shapes, strand-based hair, more garments, shoes, more body motions, and more 3D scenes. Compared with B1, training on B2 produces more accurate 3D human pose, resulting in SOTA accuracy, particularly for estimates in world coordinates. B2 lets you jointly train camera motion and human motion regressors, and we also provide depth maps. Check out data, code, dataset statistics, and much more. BEDLAM2.0 will appear in the 2025 NeurIPS Datasets and Benchmarks Track. Joint work with Joachim Tesch, Giorgio Becherini, Prerana Achar, Anastasios Yiannakidis, Muhammed Kocabas, Priyanka Patel.

The BEDLAM2.0 dataset (B2) is here, just in time to train your 3D human pose and shape estimation methods for CVPR. B2 goes beyond BEDLAM (B1) to include widely varied and natural camera motions and fields of view, more diverse body shapes, strand-based hair, more garments, shoes, more body motions, and more 3D scenes. Compared with B1, training on B2 produces more accurate 3D human pose, resulting in SOTA accuracy, particularly for estimates in world coordinates. B2 lets you jointly train camera motion and human motion regressors, and we also provide depth maps. Check out data, code, dataset statistics, and much more. BEDLAM2.0 will appear in the 2025 NeurIPS Datasets and Benchmarks Track. Joint work with Joachim Tesch, Giorgio Becherini, Prerana Achar, Anastasios Yiannakidis, Muhammed Kocabas, Priyanka Patel.

Michael Black

27,536 просмотров • 8 месяцев назад

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy) 🌐 Project: (1/4)

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy) 🌐 Project: (1/4)

Hanwen Jiang

69,607 просмотров • 1 год назад

Bring your stories to life with a 3D camera. Start with a single frame and turn it into a 3D scene you can move through, shot by shot. Control the camera. Set the pace. Shape the story.

Bring your stories to life with a 3D camera. Start with a single frame and turn it into a 3D scene you can move through, shot by shot. Control the camera. Set the pace. Shape the story.

Moonvalley

18,713 просмотров • 11 месяцев назад

GPT Image 2 + Codex: or how to make Codex not suck at UI. Step 1: Generate a UI image (native in Codex) Step 2: Get Codex to implement the UI based on it Step 3: Get Codex to iterate until it aligns with the image as much as possible Codex is bad at initial UI, but very good at implementing a reference design, so this is your way out - iterate with the image model first and then Codex will do a good job.

GPT Image 2 + Codex: or how to make Codex not suck at UI. Step 1: Generate a UI image (native in Codex) Step 2: Get Codex to implement the UI based on it Step 3: Get Codex to iterate until it aligns with the image as much as possible Codex is bad at initial UI, but very good at implementing a reference design, so this is your way out - iterate with the image model first and then Codex will do a good job.

Peter Gostev

67,210 просмотров • 3 месяцев назад

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

AK

327,105 просмотров • 2 лет назад

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

Hanwen Jiang

58,093 просмотров • 7 месяцев назад

Finally! #PHORHUM -- our 3D human reconstruction model from a single image -- is available to the research community 🎉 PHORHUM is joint work with Mihai Zanfir & Cristian Sminchisescu. How to get access: 👇

Finally! #PHORHUM -- our 3D human reconstruction model from a single image -- is available to the research community 🎉 PHORHUM is joint work with Mihai Zanfir & Cristian Sminchisescu. How to get access: 👇

Thiemo Alldieck

12,764 просмотров • 3 лет назад

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

Superman

1,426,612 просмотров • 6 дней назад

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

China open-sourced a model that reconstructs any scene in 3D from a regular video, in real-time. one camera. no LiDAR. 10,000+ frames without falling apart. just walk around with your camera and watch the entire world get rebuilt in 3D at 20 fps. → runs at ~20 FPS on a single GPU → Stable over 10,000+ frames → Beats optimization-based methods on benchmarks → Works on drone footage, driving videos, indoor walkthroughs 100% open source.

Yasir Ai

251,054 просмотров • 5 дней назад

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,590 просмотров • 10 месяцев назад

The biggest announcement from NVIDIA GTC was just released! It's the open sourcing of TWO Radiance Field methods, 3D Gaussian Ray Tracing and 3D Gaussian Unscented Transform from NVIDIA AI Developer. Not only has the code been released, but they come with an Apache 2.0 License, meaning you can use these methods commercially. This is the same license that Nerfstudio carries and is a big win for the community! 3DGRT introduces a pure ray tracing implementation, while 3DGUT brings the effects of ray tracing, such as secondary lighting effects to rasterization. This is a big deal! Article👇 Code👇

The biggest announcement from NVIDIA GTC was just released! It's the open sourcing of TWO Radiance Field methods, 3D Gaussian Ray Tracing and 3D Gaussian Unscented Transform from NVIDIA AI Developer. Not only has the code been released, but they come with an Apache 2.0 License, meaning you can use these methods commercially. This is the same license that Nerfstudio carries and is a big win for the community! 3DGRT introduces a pure ray tracing implementation, while 3DGUT brings the effects of ray tracing, such as secondary lighting effects to rasterization. This is a big deal! Article👇 Code👇

Radiance Fields

29,470 просмотров • 1 год назад

One of the hardest things to achieve with AI is precise character motion. The new model by Kinetix, Kamo-1, is amazing at giving you far more control over your generations. It’s also the first 3D-conditioned model, so it understands the scene in 3D and gives you almost unlimited camera motion. Let me show you how to use it 👇

One of the hardest things to achieve with AI is precise character motion. The new model by Kinetix, Kamo-1, is amazing at giving you far more control over your generations. It’s also the first 3D-conditioned model, so it understands the scene in 3D and gives you almost unlimited camera motion. Let me show you how to use it 👇

Everett World

19,199 просмотров • 7 месяцев назад

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 просмотров • 2 лет назад

3d visual positioning experiment -- look at the alignment between the 3d mesh and the live camera view. Truly feels magical -- like x-ray vision. Workflow: 1. Scanned a street in 15 mins w/ xgrids 2. Localized against that scan at night, in real-time, while sitting in a car The xgrids scanner (w/ rgb + lidar) is perfect to build maps for humans (3d gaussian splats). Then multiset ai makes it easy to build machine-readable maps for AR and robotics to figure out exactly where they are in 3d space with cm-level accuracy.

3d visual positioning experiment -- look at the alignment between the 3d mesh and the live camera view. Truly feels magical -- like x-ray vision. Workflow: 1. Scanned a street in 15 mins w/ xgrids 2. Localized against that scan at night, in real-time, while sitting in a car The xgrids scanner (w/ rgb + lidar) is perfect to build maps for humans (3d gaussian splats). Then multiset ai makes it easy to build machine-readable maps for AR and robotics to figure out exactly where they are in 3d space with cm-level accuracy.

Bilawal Sidhu

27,278 просмотров • 7 месяцев назад