正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Diffusion has shown great promise for generating robot actions, can it act as a world model to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a controllable action-conditioned video diffusion model can produce photorealistic... show more

Heng Yang

4,962 subscribers

38,390 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

9 条评论

Abhinav Girdhar 的头像

Abhinav Girdhar1 年前

@hanqi359246 @hcy1n @du_yilun This is a huge step forward! Using diffusion models as world models for action-conditioned predictions could revolutionize robotics. Excited to see how this improves policy learning and control.

SecurityPal 的头像

SecurityPal1 年前

In this episode of the 'In Security' Podcast, coming to you from the Himalayas, @WilHarm3, Operating Partner and CISO at @craft_ventures, and Josh Mullis, Head of Information Security at @productiv_inc, share thoughts on the evolving role of a CISO. 🔗:

LongFang 的头像

LongFang1 年前

@hanqi359246 @hcy1n @du_yilun 😮

VictorGallagher 的头像

VictorGallagher1 年前

@hanqi359246 @hcy1n @du_yilun When I see this I think 3D printer control.

T J 的头像

T J1 年前

@hanqi359246 @hcy1n @du_yilun Melt the glaciers

Rohan Sundar 的头像

Rohan Sundar1 年前

@hanqi359246 @hcy1n @du_yilun 😯

Jason Hall 的头像

Jason Hall1 年前

@hanqi359246 @hcy1n @du_yilun cool work!

Maxime Alvarez 的头像

Maxime Alvarez1 年前

@hanqi359246 @hcy1n @du_yilun Seems like a bit wasteful (for compute) to plan in image space, could we adapt this with V-JEPA which gives us video prediction in a latent space? Or is there a benefit to images?

Heng Yang 的头像

Heng Yang1 年前

@hanqi359246 @hcy1n @du_yilun Great comment. Definitely prediction in latent space should be the way forward. Perhaps not just latent space, but more structured representations that are object-centric/semantic. Images may be just a showcase of possibility and first step.

相关视频

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Glad that our work “Inference-Time Enhancement of Generative Robot Policies via Predictive World Modeling”, led by Han Qi, has been accepted to IEEE Robotics and Automation Letters! 🎉 We propose Generative Predictive Control (GPC): sample action proposals from a pretrained diffusion policy (“look back”), roll them out with a diffusion-based action-conditioned video world model (“look forward”), then rank or optimize the actions using either a learned reward model or VLM preferences. Conceptually, this is trajectory optimization / MPC with hybrid sampling + gradient optimization, interpreted through modern diffusion priors and video world models. Interestingly, we first posted the paper on arXiv in Feb 2025, when action-conditioned video world models for planning were still rare—now this direction is rapidly gaining traction. Still many open questions, e.g., • how to avoid local minima in planning • what representations work best for world models • how to balance physics priors vs. data-driven learning Paper:

Heng Yang

18,994 次观看 • 3 个月前

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.

The Humanoid Hub

11,575 次观看 • 4 个月前

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

📢 Lyra: Generative 3D Scene Reconstruction via Video Diffusion Model Self-Distillation Got only one or a few images and wondering if recovering the 3D environment is a reconstruction or generation problem? Why not do it with a generative reconstruction model! We show that a camera-conditioned video diffusion model can be transformed into a generative reconstruction model that directly outputs a high-quality 3D Gaussian Splatting representation through self-distillation, without requiring real-world training data. Check out our results in the video (wait for dynamic scenes in the second half!) : Project Page: Code and Models: Paper:

Sherwin Bahmani

66,490 次观看 • 9 个月前

Cosmos Policy turns a pretrained video diffusion model into a robot controller. Instead of redesigning the architecture, it injects robot state, actions, and values directly as latent frames inside the video model

Cosmos Policy turns a pretrained video diffusion model into a robot controller. Instead of redesigning the architecture, it injects robot state, actions, and values directly as latent frames inside the video model

Robots Digest 🤖

22,933 次观看 • 5 个月前

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

AK

684,281 次观看 • 2 年前

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to be annoyingly complex—in both the action and vision space—to even get close to real life. We did an initial attempt: Whole-Body Conditioned Egocentric Video Prediction. In collaboration with @dans_t123 , Amir Bar, Yann LeCun , trevordarrell and Jitendra MALIK. (For more details, check: What we did is very simple: Predict Egocentric Video from human Actions (PEVA) - Given the past video and a future action represented by relative 3D body pose, PEVA predicts how the world looks next—from the first-person view. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, it learns how physical actions shape perception.

What would a World Model look like if we start from a real embodied agent acting in the real world? It has to have: 1) A real, physically grounded and complex action space—not just abstract control signals. 2) Diverse, real-life scenarios and activities. Or in short: It has to be annoyingly complex—in both the action and vision space—to even get close to real life. We did an initial attempt: Whole-Body Conditioned Egocentric Video Prediction. In collaboration with @dans_t123 , Amir Bar, Yann LeCun , trevordarrell and Jitendra MALIK. (For more details, check: What we did is very simple: Predict Egocentric Video from human Actions (PEVA) - Given the past video and a future action represented by relative 3D body pose, PEVA predicts how the world looks next—from the first-person view. By conditioning on kinematic pose trajectories, structured by the joint hierarchy of the body, it learns how physical actions shape perception.

Yutong Bai

176,940 次观看 • 1 年前

working towards action-conditioned video diffusion models for now this is just cs2 gameplay, and i parse the keypresses from the .dem game file next steps will be to work on a scalable data loader then coding the model

working towards action-conditioned video diffusion models for now this is just cs2 gameplay, and i parse the keypresses from the .dem game file next steps will be to work on a scalable data loader then coding the model

Arnie Ramesh

31,090 次观看 • 4 个月前

These imagined futures can be action-conditioned to produce different outcomes. For example, the videos below are generated entirely by the neural network by simply using different prompts

These imagined futures can be action-conditioned to produce different outcomes. For example, the videos below are generated entirely by the neural network by simply using different prompts

Tesla AI

406,108 次观看 • 3 年前

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 次观看 • 2 年前

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The Humanoid Hub

68,453 次观看 • 5 个月前

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Jianlan Luo

95,100 次观看 • 5 个月前

World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable. Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more. Watch Episode #53 of RoboPapers now, with Michael Cho - Rbt/Acc and Chris Paxton!

World models — action-conditioned predictive models of the environment — are an exciting are of research for robots that can be useful both for training and for test-time compute. But video-based world models waste a lot of predictive power on reconstructing pixels, which makes model and data requirements much higher and limits how far out into the future their predictions remain viable. Instead, what if we learned a purely semantic world model, one which predicts which properties will be true about the world after a sequence of actions, without reconstructing the whole images? Jacob Berg tells us more. Watch Episode #53 of RoboPapers now, with Michael Cho - Rbt/Acc and Chris Paxton!

RoboPapers

39,322 次观看 • 6 个月前

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

Max Zhaoshuo Li 李赵硕

1,077,093 次观看 • 24 天前

Today we're excited to launch Action: a Claude computer use launcher for macOS. Action is a macOS launcher that can take actions (click, type, and more) on your Mac using Claude’s computer use API. The interface is a floating window triggered by a keyboard shortcut, similar to Spotlight. This lets you see what the model outputs as it performs actions.

Today we're excited to launch Action: a Claude computer use launcher for macOS. Action is a macOS launcher that can take actions (click, type, and more) on your Mac using Claude’s computer use API. The interface is a floating window triggered by a keyboard shortcut, similar to Spotlight. This lets you see what the model outputs as it performs actions.

Lawrence Chen

21,622 次观看 • 1 年前

Attention: this might be the video that opens the door to the future. When a SOTA agent model connects to a robot dog with no remote control— MiniMax M2.1 × Vbot, super-powered in action. A model trained in the virtual world now controls a robot in the physical world. Hello, humans. Welcome to the future.

Attention: this might be the video that opens the door to the future. When a SOTA agent model connects to a robot dog with no remote control— MiniMax M2.1 × Vbot, super-powered in action. A model trained in the virtual world now controls a robot in the physical world. Hello, humans. Welcome to the future.

MiniMax (official)

151,672 次观看 • 5 个月前

Generative models (diffusion/flow) are taking over robotics 🤖. But do we really need to model the full action distribution to control a robot? We suspected the success of Generative Control Policies (GCPs) might be "Much Ado About Noising." We rigorously tested the myths. 🧵👇

Generative models (diffusion/flow) are taking over robotics 🤖. But do we really need to model the full action distribution to control a robot? We suspected the success of Generative Control Policies (GCPs) might be "Much Ado About Noising." We rigorously tested the myths. 🧵👇

Chaoyi Pan

111,340 次观看 • 6 个月前

If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇

If you have a policy that uses diffusion/flow (e.g. diffusion VLA), you can run RL where the actor chooses the noise, which is then denoised by the policy to produce an action. This method, which we call diffusion steering (DSRL), leads to a remarkably efficient RL method! 🧵👇

Sergey Levine

152,824 次观看 • 1 年前

Westlake Robotics just dropped the General Action Expert (GAE), a general large model that can generate arbitrary actions in real-time with very low latency. It allows the robot to become your physical avatar, executing any action like a shadow. #WestlakeRobotics #GAE #Robotics

RoboHub🤖

86,012 次观看 • 8 个月前

Yann LeCun (Yann LeCun) just revealed why he left Meta, Why are LLMs an extremely narrow field, and how "World model" is the way to build really meaningful agentic future. 🎯 Beautiful and simple in 1.5 mints. "I can’t imagine we can build agentic systems without those systems having the ability to predict, in advance, what the consequences of their actions are going to be. The way we act in the world is that we can predict the consequences of our actions, and that’s what allows us to plan. So what is a world model? Given the state of an environment, a system you want to control at time t, and given an action or intervention you imagine taking, can you predict the state of the world (or the system) at time t + 1? If you can, that’s a world model. You don’t do this at a pixel level, if it’s video. You do this in an abstract representation space, and that’s a crucial key insight." --- From 'AI House Davos" YT channel (full link in comment)

Yann LeCun (Yann LeCun) just revealed why he left Meta, Why are LLMs an extremely narrow field, and how "World model" is the way to build really meaningful agentic future. 🎯 Beautiful and simple in 1.5 mints. "I can’t imagine we can build agentic systems without those systems having the ability to predict, in advance, what the consequences of their actions are going to be. The way we act in the world is that we can predict the consequences of our actions, and that’s what allows us to plan. So what is a world model? Given the state of an environment, a system you want to control at time t, and given an action or intervention you imagine taking, can you predict the state of the world (or the system) at time t + 1? If you can, that’s a world model. You don’t do this at a pixel level, if it’s video. You do this in an abstract representation space, and that’s a crucial key insight." --- From 'AI House Davos" YT channel (full link in comment)

Rohan Paul

383,162 次观看 • 5 个月前

Big news from the TeleAI embodied intelligence team! They unveiled TextOp, a general cerebellar framework for real-time, text-driven humanoid control. TextOp lets humans control the robot via natural language, dynamically modifying instructions during runtime to instantly generate smooth, whole-body actions. This enables precise and highly versatile control. The framework uses a two-layer architecture: an action diffusion model (High-Level) and a general motion tracking policy (Low-Level). It’s a new paradigm that eliminates the need for pre-recorded scripts or manual programming.

Big news from the TeleAI embodied intelligence team! They unveiled TextOp, a general cerebellar framework for real-time, text-driven humanoid control. TextOp lets humans control the robot via natural language, dynamically modifying instructions during runtime to instantly generate smooth, whole-body actions. This enables precise and highly versatile control. The framework uses a two-layer architecture: an action diffusion model (High-Level) and a general motion tracking policy (Low-Level). It’s a new paradigm that eliminates the need for pre-recorded scripts or manual programming.

RoboHub🤖

23,804 次观看 • 7 个月前