Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Google presents CAT4D Create Anything in 4D with Multi-View Video Diffusion Models

AK

415,531 subscribers

61,949 views • 1 year ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

9 Comments

AK1 year ago

discuss:

Rundi Wu1 year ago

Thanks for sharing our work! Project page: arXiv:

BensenHsu1 year ago

The paper presents a method called CAT4D (Create Anything in 4D) that can generate high-quality dynamic 3D scenes from a single input monocular video. The key idea is to leverage a multi-view video diffusion model trained on a diverse combination of datasets to enable novel view synthesis at any specified camera poses and timestamps. The authors evaluate their method on various tasks, including novel view synthesis, sparse-view static 3D reconstruction in the presence of scene motion, and 4D reconstruction from monocular videos. They show that their method can generate high-quality dynamic 3D scenes and outperforms existing state-of-the-art models that depend on multiple priors and external sources of information. full paper:

HistoricTechOmar Samir1 year ago

CAT4D? More like create anything in 4D and amaze me!

Daveheardt1 year ago

4D? Like 4 dimensions? If so - this is not it, this is 3D.

plugbrain1 year ago

Any chance of a code release?

Zero Vertex1 year ago

I wish my cat could bake like that. jk I don't have a cat

Fleeber1 year ago

oooo

RinGo_3.01 year ago

👀

Related Videos

Google presents LightLab Controlling Light Sources in Images with Diffusion Models

Google presents LightLab Controlling Light Sources in Images with Diffusion Models

AK

122,981 views • 1 year ago

Diffuman4D 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

Diffuman4D 4D Consistent Human View Synthesis from Sparse-View Videos with Spatio-Temporal Diffusion Models

AK

12,518 views • 11 months ago

We’ve upgraded Stable Video Diffusion 4D to Stable Video 4D 2.0 (SV4D 2.0), improving the quality of 4D outputs generated from a single object-centric video. While 3D provides a static view of an object’s shape and size; 4D extends this by including time, showing how the object moves. This multi-view video diffusion model generates a 4D output in three steps: 1️⃣ Starts with an input video of a moving person or object 2️⃣ Generates novel views of the subject from unseen angles 3️⃣ Constructs a single dynamic 4D video output with spatial and temporal consistency You can learn more here: (1/4)

We’ve upgraded Stable Video Diffusion 4D to Stable Video 4D 2.0 (SV4D 2.0), improving the quality of 4D outputs generated from a single object-centric video. While 3D provides a static view of an object’s shape and size; 4D extends this by including time, showing how the object moves. This multi-view video diffusion model generates a 4D output in three steps: 1️⃣ Starts with an input video of a moving person or object 2️⃣ Generates novel views of the subject from unseen angles 3️⃣ Constructs a single dynamic 4D video output with spatial and temporal consistency You can learn more here: (1/4)

Stability AI

35,974 views • 1 year ago

InsertAnywhere Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

InsertAnywhere Bridging 4D Scene Geometry and Diffusion Models for Realistic Video Object Insertion

AK

26,123 views • 5 months ago

Nvidia presents Articulated Kinematics Distillation from Video Diffusion Models

Nvidia presents Articulated Kinematics Distillation from Video Diffusion Models

AK

39,189 views • 1 year ago

🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models. w/ lovely coauthors Aleksander Holynski, Ben Poole and an amazing team!

🌟 Create anything in 3D! 🌟 Introducing CAT3D: a new method that generates high-fidelity 3D scenes from any number of real or generated images in one minute, powered by multi-view diffusion models. w/ lovely coauthors Aleksander Holynski, Ben Poole and an amazing team!

Ruiqi Gao

152,867 views • 2 years ago

Nvidia just announced Align Your Gaussians Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

Nvidia just announced Align Your Gaussians Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models

AK

131,297 views • 2 years ago

Generative Novel View Synthesis with 3D-Aware Diffusion Models abs: project page:

Generative Novel View Synthesis with 3D-Aware Diffusion Models abs: project page:

AK

304,708 views • 3 years ago

Google presents VLOGGER Multimodal Diffusion for Embodied Avatar Synthesis We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of

AK

66,375 views • 2 years ago

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

MVDream: Multi-view Diffusion for 3D Generation paper page: propose MVDream, a multi-view diffusion model that is able to generate geometrically consistent multi-view images from a given text prompt. By leveraging image diffusion models pre-trained on large-scale web datasets and a multi-view dataset rendered from 3D assets, the resulting multi-view diffusion model can achieve both the generalizability of 2D diffusion and the consistency of 3D data. Such a model can thus be applied as a multi-view prior for 3D generation via Score Distillation Sampling, where it greatly improves the stability of existing 2D-lifting methods by solving the 3D consistency problem. Finally, we show that the multi-view diffusion model can also be fine-tuned under a few shot setting for personalized 3D generation, i.e. DreamBooth3D application, where the consistency can be maintained after learning the subject identity.

AK

294,442 views • 2 years ago

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

Gradio

41,770 views • 1 year ago

Most multi-view reconstruction models need full supervision. We show they can self-improve without any ground truth labels. Introducing SelfEvo: Self-Improving 4D Perception via Self-Distillation. Up to +36.5% in video depth, +20.1% in camera estimation, zero annotation.

Most multi-view reconstruction models need full supervision. We show they can self-improve without any ground truth labels. Introducing SelfEvo: Self-Improving 4D Perception via Self-Distillation. Up to +36.5% in video depth, +20.1% in camera estimation, zero annotation.

Qianqian Wang

24,309 views • 2 months ago

Meta presents Adaptive Caching for Faster Video Generation with Diffusion Transformers

Meta presents Adaptive Caching for Faster Video Generation with Diffusion Transformers

AK

53,119 views • 1 year ago

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

V3D Video Diffusion Models are Effective 3D Generators Automatic 3D generation has recently attracted widespread attention. Recent methods have greatly accelerated the generation speed, but usually produce less-detailed objects due to limited model capacity or 3D data. Motivated by recent advancements in video diffusion models, we introduce V3D, which leverages the world simulation capacity of pre-trained video diffusion models to facilitate 3D generation. To fully unleash the potential of video diffusion to perceive the 3D world, we further introduce geometrical consistency prior and extend the video diffusion model to a multi-view consistent 3D generator. Benefiting from this, the state-of-the-art video diffusion model could be fine-tuned to generate 360degree orbit frames surrounding an object given a single image. With our tailored reconstruction pipelines, we can generate high-quality meshes or 3D Gaussians within 3 minutes. Furthermore, our method can be extended to scene-level novel view synthesis, achieving precise control over the camera path with sparse input views. Extensive experiments demonstrate the superior performance of the proposed approach, especially in terms of generation quality and multi-view consistency

AK

31,997 views • 2 years ago

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Robot Learning needs 4D world models! Robot Learning needs 4D world models! Robot Learning needs 4D world models! We introduce TesserAct, a 4D embodied world model that can simulate how agents interact with the 3D world over time! We achieve this by simply extending a pre-trained 2D video generation model to jointly predict RGB, depth, and surface normals. It enables: 1️⃣ Much better policy learning in the wild 2️⃣ Temporal + spatial coherence in 4D dynamic prediction 3️⃣ Novel view synthesis for embodied scenes Code: Paper Link: Project page:

Chuang Gan

43,265 views • 1 year ago

Nvidia presents Align Your Steps Optimizing Sampling Schedules in Diffusion Models Diffusion models (DMs) have established themselves as the state-of-the-art generative modeling approach in the visual domain and beyond. A crucial drawback of DMs is their slow sampling speed,

Nvidia presents Align Your Steps Optimizing Sampling Schedules in Diffusion Models Diffusion models (DMs) have established themselves as the state-of-the-art generative modeling approach in the visual domain and beyond. A crucial drawback of DMs is their slow sampling speed,

AK

32,888 views • 2 years ago

Meta presents: Pippo : High-Resolution Multi-View Humans from a Single Image Generates 1K resolution, multi-view, studio-quality images from a single photo in a one forward pass

Meta presents: Pippo : High-Resolution Multi-View Humans from a Single Image Generates 1K resolution, multi-view, studio-quality images from a single photo in a one forward pass

Aran Komatsuzaki

32,503 views • 1 year ago