正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

🚀 Introducing InterDyn — our newly accepted CVPR work that explores controllable synthesis of interactive dynamics! Building upon powerful video diffusion models, InterDyn infers future motion and interactions directly from an input image and a dynamic control signal (e.g., a moving hand mask). Check out how we push the... show more

Haven (Haiwen) Feng

1,139 subscribers

44,898 次观看 • 1 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 条评论

Haven (Haiwen) Feng 的头像

Haven (Haiwen) Feng1 年前

Dynamic Control & Beyond: Unlike prior methods that rely on explicit simulation or only static state transitions, InterDyn built a dynamic control branch on top of Stable Video Diffusion. We then fine-tune it to generate complex interactions (e.g. hand-object manipulations) and realistic multi-object collisions without heavy simulation computations. #StableDiffusion #StabilityAI 🧵2/6

Haven (Haiwen) Feng 的头像

Haven (Haiwen) Feng1 年前

Intuitive Physics: At its core, InterDyn showcases the diffusion model’s “knowledge” of real-world physics and causal effects. By simply providing a moving object mask, the system implicitly models collisions, force propagation, and object dynamics—no 3D reconstruction or separate physics engine needed. 🧵3/6

Haven (Haiwen) Feng 的头像

Haven (Haiwen) Feng1 年前

Superior Performance: We evaluate InterDyn on both synthetic (CLEVRER) and real-world datasets (Something-Something-v2), achieving up to 37.5% improvement on LPIPS and 77% on FVD over prior work. Whether it’s multi-object collisions or hand-object manipulations, InterDyn produces diverse and physically plausible videos. 🧵4/6

Haven (Haiwen) Feng 的头像

Haven (Haiwen) Feng1 年前

Toward Interactive Video Generation: This new perspective merges intuitive physics with large-scale generative models, opening the door to controllable dynamics synthesis in complex scenes. We believe InterDyn lays the groundwork for future explorations in interactive video generation. Stay tuned for more! 🧵5/6

Haven (Haiwen) Feng 的头像

Haven (Haiwen) Feng1 年前

This work was co-lead by me and our talented Master intern @rick_akker25502 (He's applying for PhD now, hire him!) together with amazing advisors, @Michael_J_Black , @dimtzionas and @vfabrevaya . More details & demos coming soon! See you in Nashville! #CVPR2025 #AI #ResearchPapers 🧵6/6

Adam 的头像

Adam1 年前

Great work! I had a similar idea for hand-object interaction with video generation but with 3D conditioning

Robert Scoble 的头像

Robert Scoble1 年前

Wow great work!

Kangfu Mei 的头像

Kangfu Mei1 年前

Very nice and creative work!

Daniel Sungho Jung 的头像

Daniel Sungho Jung1 年前

Interesting work! Were there any challenges during the research?

Erika S 的头像

Erika S1 年前

InterDyn’s approach to controllable synthesis of interactive dynamics is fascinating. I’m excited to see how it advances intuitive physics with video generative models—truly pushing boundaries in AI and computer vision!

相关视频

Google presents VLOGGER Multimodal Diffusion for Embodied Avatar Synthesis We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of

AK

66,375 次观看 • 2 年前

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

Matthias Niessner

19,835 次观看 • 2 年前

Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis

AK

28,101 次观看 • 11 个月前

Generative Novel View Synthesis with 3D-Aware Diffusion Models abs: project page:

Generative Novel View Synthesis with 3D-Aware Diffusion Models abs: project page:

AK

304,708 次观看 • 3 年前

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

AK

718,682 次观看 • 3 年前

I'm excited to share our new work Diffusion as Shader (DaS), a versatile controllable video generation method for various tasks: object manipulation, camera control, mesh-to-video, and motion transfer. Project page: Github:

I'm excited to share our new work Diffusion as Shader (DaS), a versatile controllable video generation method for various tasks: object manipulation, camera control, mesh-to-video, and motion transfer. Project page: Github:

Yuan Liu

33,363 次观看 • 1 年前

VSTAR Generative Temporal Nursing for Longer Dynamic Video Synthesis Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to

VSTAR Generative Temporal Nursing for Longer Dynamic Video Synthesis Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to

AK

36,982 次观看 • 2 年前

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,417 次观看 • 8 个月前

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Heng Yang

18,968 次观看 • 3 个月前

Scaling up GANs for Text-to-Image Synthesis present our 1B-parameter GigaGAN, achieving lower FID than Stable Diffusion v1.5, DALL·E 2, and Parti-750M. It generates 512px outputs at 0.13s, orders of magnitude faster than diffusion and autoregressive models, and inherits the disentangled, continuous, and controllable latent space of GANs abs: project page:

Scaling up GANs for Text-to-Image Synthesis present our 1B-parameter GigaGAN, achieving lower FID than Stable Diffusion v1.5, DALL·E 2, and Parti-750M. It generates 512px outputs at 0.13s, orders of magnitude faster than diffusion and autoregressive models, and inherits the disentangled, continuous, and controllable latent space of GANs abs: project page:

AK

278,115 次观看 • 3 年前

Dreamix: Video Diffusion Models are General Video Editors abs: project page: present diffusion-based method that is able to perform text-based motion and appearance editing of general videos

Dreamix: Video Diffusion Models are General Video Editors abs: project page: present diffusion-based method that is able to perform text-based motion and appearance editing of general videos

AK

398,132 次观看 • 3 年前

👀 Pixel perfect 💎✨ 🖼️ Edify Image from #NVIDIAResearch is a family of diffusion models that supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360° HDR panorama generation, and finetuning for image customization. 🧵 1/2

👀 Pixel perfect 💎✨ 🖼️ Edify Image from #NVIDIAResearch is a family of diffusion models that supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360° HDR panorama generation, and finetuning for image customization. 🧵 1/2

NVIDIA AI Developer

14,747 次观看 • 1 年前

Today we're introducing Gen-4, our new series of state-of-the-art AI models for media generation and world consistency. Gen-4 is a significant step forward for fidelity, dynamic motion and controllability in generative media. Gen-4 Image-to-Video is rolling out today to all paid plans and Enterprise customers. 1/8

Today we're introducing Gen-4, our new series of state-of-the-art AI models for media generation and world consistency. Gen-4 is a significant step forward for fidelity, dynamic motion and controllability in generative media. Gen-4 Image-to-Video is rolling out today to all paid plans and Enterprise customers. 1/8

Runway

736,120 次观看 • 1 年前

I finally released my new video on YouTube about Diffusion Models / Score-Based Generative Models. Literally planned this for a year and put so much work in. I think this approach to diffusion models is so intuitive and highly recommend giving that a go! Video is 38min long, so you will need some time to watch that haha.

I finally released my new video on YouTube about Diffusion Models / Score-Based Generative Models. Literally planned this for a year and put so much work in. I think this approach to diffusion models is so intuitive and highly recommend giving that a go! Video is 38min long, so you will need some time to watch that haha.

dome | Outlier

54,540 次观看 • 1 年前

Punchline: World models == VQA (about the future)! Planning with world models can be powerful for robotics/control. But most world models are video generators trained to predict everything, including irrelevant pixels and distractions. We ask - what if a world model only predicted the semantic information necessary for decision-making? Introducing Semantic World Models (SWM). Given an observation and an action sequence, SWMs cast modeling as answering textual questions about the future outcome resulting from the actions. Recasting world modeling as a VQA problem lets us directly leverage the pretrained knowledge and machinery of VLMs for generalizable modeling. We had a lot of fun thinking about how this work helps connect these two seemingly very different fields of study - VLMs and world models! 🧵(1/6) Paper: Fun demo:

Punchline: World models == VQA (about the future)! Planning with world models can be powerful for robotics/control. But most world models are video generators trained to predict everything, including irrelevant pixels and distractions. We ask - what if a world model only predicted the semantic information necessary for decision-making? Introducing Semantic World Models (SWM). Given an observation and an action sequence, SWMs cast modeling as answering textual questions about the future outcome resulting from the actions. Recasting world modeling as a VQA problem lets us directly leverage the pretrained knowledge and machinery of VLMs for generalizable modeling. We had a lot of fun thinking about how this work helps connect these two seemingly very different fields of study - VLMs and world models! 🧵(1/6) Paper: Fun demo:

Abhishek Gupta

61,150 次观看 • 7 个月前

We present Neural Riemannian Motion Fields (NRMF) as a promising paradigm for generative modeling of articulated motion. Unlike various diffusion based models, NRMF is useful both in generation and in deployment as a prior across tasks: #CVPR #AI #CVML👇

We present Neural Riemannian Motion Fields (NRMF) as a promising paradigm for generative modeling of articulated motion. Unlike various diffusion based models, NRMF is useful both in generation and in deployment as a prior across tasks: #CVPR #AI #CVML👇

Tolga Birdal

16,624 次观看 • 8 个月前

🚀Excited to introduce GEN3C #CVPR2025, a generative video model with an explicit 3D cache for precise camera control. 🎥It applies to multiple use cases, including single-view and sparse-view NVS🖼️ and challenging settings like monocular dynamic NVS and driving simulation🚗. Project page:

🚀Excited to introduce GEN3C #CVPR2025, a generative video model with an explicit 3D cache for precise camera control. 🎥It applies to multiple use cases, including single-view and sparse-view NVS🖼️ and challenging settings like monocular dynamic NVS and driving simulation🚗. Project page:

Xuanchi Ren

59,931 次观看 • 1 年前

Netflix just dropped Go-with-the-Flow Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

Netflix just dropped Go-with-the-Flow Motion-Controllable Video Diffusion Models Using Real-Time Warped Noise

AK

95,542 次观看 • 1 年前

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

(1/n) Time to unify your favorite visual generative models, VLMs, and simulators for controllable visual generation—Introducing a Product of Experts (PoE) framework for inference-time knowledge composition from heterogeneous models.

Yunzhi Zhang

48,871 次观看 • 1 年前

1/ 🚀 Introducing AIDO.StructureDiffusion: A generative model for structural protein design—enabling high-quality, controllable generation of monomers, complexes, and antibodies. 🧵

1/ 🚀 Introducing AIDO.StructureDiffusion: A generative model for structural protein design—enabling high-quality, controllable generation of monomers, complexes, and antibodies. 🧵

GenBio AI

918,205 次观看 • 11 个月前