Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

There's a problem with 3D human pose & shape (HPS) estimation methods. You either get good 3D accuracy or good alignment with the image, but not both. Why? The current top methods use the wrong camera model. TokenHMR at #CVPR2024 analyzes the issue and presents a solution. (1/8)

Michael Black

97,513 subscribers

80,462 Aufrufe • vor 2 Jahren •via X (Twitter)

Bildung Wissenschaft & Technologie Nachrichten & Politik #CVPR2024

Anya Rossi• Live Now

Private livecam show

11 Kommentare

Profilbild von Michael Black

Michael Blackvor 2 Jahren

Current HPS methods use a simplified camera model that differs from the true camera. With the wrong camera, you have to distort the body pose or shape so that projected 3D features match the image. Estimating the true camera, however, is a challenging and unsolved problem (2/8)

Profilbild von Michael Black

Michael Blackvor 2 Jahren

Using BEDLAM, a synthetic dataset with perfect ground-truth (GT), we quantitatively evaluate the problem. With the HMR2.0 camera, we evaluate the 2D projection error of 3D bodies computed by HMR2.0 and GT bodies. With the wrong camera HM2.0 gets lower 2D error than GT. (3/8)

Profilbild von Michael Black

Michael Blackvor 2 Jahren

On the flip side, low 2D reprojection error results in worse 3D accuracy. For a given 2D image alignment error, there are effectively an infinite number of 3D poses that can produce this, and they can be really bad. Training a method with a 2D loss and wrong camera is bad. (4/8)

Profilbild von Michael Black

Michael Blackvor 2 Jahren

3D pseudo-GT that's estimated from 2D with the wrong camera has the same issue. To address this, we introduce two solutions. First, with 2D data, the loss should not try to fit it too well. Our new TALS loss penalizes large 2D errors while down-weighting small ones. (5/8)

Profilbild von Michael Black

Michael Blackvor 2 Jahren

With TALS, common pose priors have too much influence. Thus we use a VQ-VAE to convert continuous poses to a discrete token representation; trained on AMASS & MOYO. This pre-trained tokenizer provides a vocabulary of valid poses. Pose regression becomes classification. (6/8)

Profilbild von Michael Black

Michael Blackvor 2 Jahren

TokenHMR estimates 3D HPS using a discrete tokenized pose representation. Our TALS loss mitigates some of the bias caused by simplified camera models and biased pseudo-GT. This enables training on 2D data for robustness without losing 3D accuracy. (7/8)

Profilbild von Michael Black

Michael Blackvor 2 Jahren

Kudos to the authors: @saidwivedi, @yusun14567741, @PriyankaP1201, @YaoFeng1995 and @Michael_J_Black from @MPI_IS, @meshcapade and @ETH_en arXiv: Code and models are available at (8/8)

Profilbild von Tope Ibrahim

Tope Ibrahimvor 2 Jahren

Dear Prof. Black, I will be attending the CVPR conference in Seattle between 18-21 of June as a first-timer. Over the years, you have remained one of the researchers I often draw inspiration from, and I will be very honoured to meet you in person.

Profilbild von Michael Black

Michael Blackvor 2 Jahren

I look forward to meeting you! Come find me at one of our posters.

Profilbild von Mathieu Tuli

Mathieu Tulivor 2 Jahren

Great work, excited to come chat at cvpr We’ll be there presenting FlowFace as well (face tracking from 2D video) would love to have you come by

Profilbild von Michael Black

Michael Blackvor 2 Jahren

Thanks for sharing this! I like the UV-flow idea. It combines two of my favorite things: 3D shape estimation and optical flow :) Fun fact: my very first paper on human faces used optical for expression recognition.

Ähnliche Videos

WHAM defines the new state of the art in 3D human pose estimation from video. By a large margin. It’s fast, accurate, and it computes human pose in world coordinates. It’s also the first video-based method to be more accurate than single-image methods. 1/8

WHAM defines the new state of the art in 3D human pose estimation from video. By a large margin. It’s fast, accurate, and it computes human pose in world coordinates. It’s also the first video-based method to be more accurate than single-image methods. 1/8

Michael Black

118,411 Aufrufe • vor 2 Jahren

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

AK

194,356 Aufrufe • vor 2 Jahren

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 Aufrufe • vor 11 Monaten

Phidias A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion discuss: In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.

Phidias A Generative Model for Creating 3D Content from Text, Image, and 3D Conditions with Reference-Augmented Diffusion discuss: In 3D modeling, designers often use an existing 3D model as a reference to create new ones. This practice has inspired the development of Phidias, a novel generative model that uses diffusion for reference-augmented 3D generation. Given an image, our method leverages a retrieved or user-provided 3D reference model to guide the generation process, thereby enhancing the generation quality, generalization ability, and controllability. Our model integrates three key components: 1) meta-ControlNet that dynamically modulates the conditioning strength, 2) dynamic reference routing that mitigates misalignment between the input image and 3D reference, and 3) self-reference augmentations that enable self-supervised training with a progressive curriculum. Collectively, these designs result in a clear improvement over existing methods. Phidias establishes a unified framework for 3D generation using text, image, and 3D conditions with versatile applications.

AK

25,120 Aufrufe • vor 1 Jahr

3D editing is hard: you need to ground an image + instruction and generate a faithful 3D shape in one forward pass -- no test-time optimization. So, we steer pretrained image-to-3D representations to do text-guided 3D edits; no massive 3D edit-pair dataset needed. Key trap: the “no-edit” solution is a nasty local minimum. We fix it with preference optimization, pushing the model to actually edit. Steer3D is the second work that adapts alignment ideas from LLMs to the 3D modality. SAM 3D also used DPO to improve its 3D generations.

3D editing is hard: you need to ground an image + instruction and generate a faithful 3D shape in one forward pass -- no test-time optimization. So, we steer pretrained image-to-3D representations to do text-guided 3D edits; no massive 3D edit-pair dataset needed. Key trap: the “no-edit” solution is a nasty local minimum. We fix it with preference optimization, pushing the model to actually edit. Steer3D is the second work that adapts alignment ideas from LLMs to the 3D modality. SAM 3D also used DPO to improve its 3D generations.

Georgia Gkioxari

115,844 Aufrufe • vor 6 Monaten

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

Code and data are now online for CameraHMR, our state-of-the-art parametric 3D human pose and shape (HPS) estimation method that will appear at hashtag#3DV2025. There are 4 key contributions that make it so accurate and robust: 1. To get accurate 3D shape and pose as well as good alignment to image features, you need to know the focal length of the camera. To solve this, we train HumanFOV to compute the field of view. 2. We introduce CameraHMR, which integrates HumanFOV into HMR2.0 to exploit the estimated focal length. 3. To get accurate pseudo ground truth (pGT) training data, we compute the focal length for images in 4DHumans dataset and modify SMPLify to take this into account. 4. But SMPLify only uses sparse 2D keypoints, which do not capture body shape. So we train a dense surface keypoint detector, DenseKP, on BEDLAM and run it on 4DHumans, resulting in improved body shape. The resulting method is CamSMPLify. We iterate training CameraHMR and running CamSMPLify on the training set initialized with CameraHMR. This results in much improved pGT for 4DHumans and a SOTA single-image HMR method.

Michael Black

21,647 Aufrufe • vor 1 Jahr

The BEDLAM2.0 dataset (B2) is here, just in time to train your 3D human pose and shape estimation methods for CVPR. B2 goes beyond BEDLAM (B1) to include widely varied and natural camera motions and fields of view, more diverse body shapes, strand-based hair, more garments, shoes, more body motions, and more 3D scenes. Compared with B1, training on B2 produces more accurate 3D human pose, resulting in SOTA accuracy, particularly for estimates in world coordinates. B2 lets you jointly train camera motion and human motion regressors, and we also provide depth maps. Check out data, code, dataset statistics, and much more. BEDLAM2.0 will appear in the 2025 NeurIPS Datasets and Benchmarks Track. Joint work with Joachim Tesch, Giorgio Becherini, Prerana Achar, Anastasios Yiannakidis, Muhammed Kocabas, Priyanka Patel.

The BEDLAM2.0 dataset (B2) is here, just in time to train your 3D human pose and shape estimation methods for CVPR. B2 goes beyond BEDLAM (B1) to include widely varied and natural camera motions and fields of view, more diverse body shapes, strand-based hair, more garments, shoes, more body motions, and more 3D scenes. Compared with B1, training on B2 produces more accurate 3D human pose, resulting in SOTA accuracy, particularly for estimates in world coordinates. B2 lets you jointly train camera motion and human motion regressors, and we also provide depth maps. Check out data, code, dataset statistics, and much more. BEDLAM2.0 will appear in the 2025 NeurIPS Datasets and Benchmarks Track. Joint work with Joachim Tesch, Giorgio Becherini, Prerana Achar, Anastasios Yiannakidis, Muhammed Kocabas, Priyanka Patel.

Michael Black

27,407 Aufrufe • vor 7 Monaten

3D Gaussian Splatting is great, but can it work without the pre-computed camera poses? Introducing: COLMAP-Free 3D Gaussian Splatting Our recent work shows not only it can, but 3D Gaussians make camera pose estimation easy (compared to NeRF) along with reconstruction. 👇🧵

3D Gaussian Splatting is great, but can it work without the pre-computed camera poses? Introducing: COLMAP-Free 3D Gaussian Splatting Our recent work shows not only it can, but 3D Gaussians make camera pose estimation easy (compared to NeRF) along with reconstruction. 👇🧵

Xiaolong Wang

76,747 Aufrufe • vor 2 Jahren

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy) 🌐 Project: (1/4)

Supervised learning has held 3D Vision back for too long. Meet RayZer — a self-supervised 3D model trained with zero 3D labels: ❌ No supervision of camera & geometry ✅ Just RGB images And the wild part? RayZer outperforms supervised methods (as 3D labels from COLMAP is noisy) 🌐 Project: (1/4)

Hanwen Jiang

69,527 Aufrufe • vor 1 Jahr

Bring your stories to life with a 3D camera. Start with a single frame and turn it into a 3D scene you can move through, shot by shot. Control the camera. Set the pace. Shape the story.

Bring your stories to life with a 3D camera. Start with a single frame and turn it into a 3D scene you can move through, shot by shot. Control the camera. Set the pace. Shape the story.

Moonvalley

18,713 Aufrufe • vor 10 Monaten

Introducing SAM 3D, the newest addition to the SAM collection, bringing common sense 3D understanding of everyday images. SAM 3D includes two models: 🛋️ SAM 3D Objects for object and scene reconstruction 🧑‍🤝‍🧑 SAM 3D Body for human pose and shape estimation Both models achieve state-of-the-art performance transforming static 2D images into vivid, accurate reconstructions. 🔗 Learn more:

Introducing SAM 3D, the newest addition to the SAM collection, bringing common sense 3D understanding of everyday images. SAM 3D includes two models: 🛋️ SAM 3D Objects for object and scene reconstruction 🧑‍🤝‍🧑 SAM 3D Body for human pose and shape estimation Both models achieve state-of-the-art performance transforming static 2D images into vivid, accurate reconstructions. 🔗 Learn more:

AI at Meta

857,203 Aufrufe • vor 6 Monaten

Love these old meets new vfx workflows 1. extract a still of the object you want to augment 2. use image-to-3d to make a 3d model of the object 3. use that 3d geometry for classical object tracking Then you can go wild with complete control

Love these old meets new vfx workflows 1. extract a still of the object you want to augment 2. use image-to-3d to make a 3d model of the object 3. use that 3d geometry for classical object tracking Then you can go wild with complete control

Bilawal Sidhu

33,824 Aufrufe • vor 1 Jahr

GPT Image 2 + Codex: or how to make Codex not suck at UI. Step 1: Generate a UI image (native in Codex) Step 2: Get Codex to implement the UI based on it Step 3: Get Codex to iterate until it aligns with the image as much as possible Codex is bad at initial UI, but very good at implementing a reference design, so this is your way out - iterate with the image model first and then Codex will do a good job.

GPT Image 2 + Codex: or how to make Codex not suck at UI. Step 1: Generate a UI image (native in Codex) Step 2: Get Codex to implement the UI based on it Step 3: Get Codex to iterate until it aligns with the image as much as possible Codex is bad at initial UI, but very good at implementing a reference design, so this is your way out - iterate with the image model first and then Codex will do a good job.

Peter Gostev

66,675 Aufrufe • vor 1 Monat

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

Drivable 3D Gaussian Avatars paper page: present Drivable 3D Gaussian Avatars (D3GA), the first 3D controllable model for human bodies rendered with Gaussian splats. Current photorealistic drivable avatars require either accurate 3D registrations during training, dense input images during testing, or both. The ones based on neural radiance fields also tend to be prohibitively slow for telepresence applications. This work uses the recently presented 3D Gaussian Splatting (3DGS) technique to render realistic humans at real-time framerates, using dense calibrated multi-view videos as input. To deform those primitives, we depart from the commonly used point deformation method of linear blend skinning (LBS) and use a classic volumetric deformation method: cage deformations. Given their smaller size, we drive these deformations with joint angles and keypoints, which are more suitable for communication applications. Our experiments on nine subjects with varied body shapes, clothes, and motions obtain higher-quality results than state-of-the-art methods when using the same training and test data.

AK

327,040 Aufrufe • vor 2 Jahren

🌟Transform an image into a 3D model in just 5 easy steps! 1️⃣Visit the Hunyuan 3D website and log in: 2️⃣Navigate to the "3D Creation" page 3️⃣Choose the "Image to 3D" feature 🎨 4️⃣Upload your image and click "Generate Immediately" 5️⃣Wait a moment , and voilà—your stunning 3D model is ready! ⏳✨ Watch our tutorial video to get started! 😸🎥

🌟Transform an image into a 3D model in just 5 easy steps! 1️⃣Visit the Hunyuan 3D website and log in: 2️⃣Navigate to the "3D Creation" page 3️⃣Choose the "Image to 3D" feature 🎨 4️⃣Upload your image and click "Generate Immediately" 5️⃣Wait a moment , and voilà—your stunning 3D model is ready! ⏳✨ Watch our tutorial video to get started! 😸🎥

Hunyuan

101,468 Aufrufe • vor 1 Jahr

Image to 3D, now in ANY pose! 🤸‍♂️ Simply upload your image + a pose reference photo. Meshy generates the model to match it perfectly. No complex rigging needed. Create 3D characters your way.

Image to 3D, now in ANY pose! 🤸‍♂️ Simply upload your image + a pose reference photo. Meshy generates the model to match it perfectly. No complex rigging needed. Create 3D characters your way.

MeshyAI

103,213 Aufrufe • vor 4 Monaten

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

(1/N) Will this be the BERT/GPT moment for 3D vision？ Finally, unsupervised pre-training for 3D works. Led by Qitao Zhao , we present E-RayZer — a fully self-supervised 3D reconstruction model that: 🔥Matches or surpasses supervised methods like VGGT 👀Learns transferable 3D representations, outperforming CroCo, VideoMAE, and DINO 📈Scales with more unlabeled data A new recipe for scalable 3D foundation models.

Hanwen Jiang

57,886 Aufrufe • vor 6 Monaten

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

Multi-modal #LLMs understand a lot about humans. But do they understand our 3D pose? We train #PoseGPT to estimate, generate, and reason about 3D human pose (#SMPL) in images and text. This is the first true foundation model for understanding 3D humans.

Michael Black

81,365 Aufrufe • vor 2 Jahren

Finally! #PHORHUM -- our 3D human reconstruction model from a single image -- is available to the research community 🎉 PHORHUM is joint work with Mihai Zanfir & Cristian Sminchisescu. How to get access: 👇

Finally! #PHORHUM -- our 3D human reconstruction model from a single image -- is available to the research community 🎉 PHORHUM is joint work with Mihai Zanfir & Cristian Sminchisescu. How to get access: 👇

Thiemo Alldieck

12,764 Aufrufe • vor 3 Jahren