Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Diffusion has shown great promise for generating robot **actions**, can it act as a **world model** to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a **controllable** action-conditioned video diffusion model can produce photorealistic...

38,390 görüntüleme • 1 yıl önce •via X (Twitter)

9 Yorum

Abhinav Girdhar profil fotoğrafı
Abhinav Girdhar1 yıl önce

@hanqi359246 @hcy1n @du_yilun This is a huge step forward! Using diffusion models as world models for action-conditioned predictions could revolutionize robotics. Excited to see how this improves policy learning and control.

SecurityPal profil fotoğrafı
SecurityPal1 yıl önce

In this episode of the 'In Security' Podcast, coming to you from the Himalayas, @WilHarm3, Operating Partner and CISO at @craft_ventures, and Josh Mullis, Head of Information Security at @productiv_inc, share thoughts on the evolving role of a CISO. 🔗:

LongFang profil fotoğrafı
LongFang1 yıl önce

@hanqi359246 @hcy1n @du_yilun 😮

VictorGallagher profil fotoğrafı
VictorGallagher1 yıl önce

@hanqi359246 @hcy1n @du_yilun When I see this I think 3D printer control.

T J profil fotoğrafı
T J1 yıl önce

@hanqi359246 @hcy1n @du_yilun Melt the glaciers

Rohan Sundar profil fotoğrafı
Rohan Sundar1 yıl önce

@hanqi359246 @hcy1n @du_yilun 😯

Jason Hall profil fotoğrafı
Jason Hall1 yıl önce

@hanqi359246 @hcy1n @du_yilun cool work!

Maxime Alvarez profil fotoğrafı
Maxime Alvarez1 yıl önce

@hanqi359246 @hcy1n @du_yilun Seems like a bit wasteful (for compute) to plan in image space, could we adapt this with V-JEPA which gives us video prediction in a latent space? Or is there a benefit to images?

Heng Yang profil fotoğrafı
Heng Yang1 yıl önce

@hanqi359246 @hcy1n @du_yilun Great comment. Definitely prediction in latent space should be the way forward. Perhaps not just latent space, but more structured representations that are object-centric/semantic. Images may be just a showcase of possibility and first step.

Benzer Videolar

This is THE moment of Physical AI! We are officially announcing Cosmos 3: Omnimodal World Models for Physical AI 🚀 - Cosmos 3 is an omnimodal world model: within a unified architecture, it can understand and generate language, images, video, audio, and actions. - It is not just a VLM, not just a video generator, not just an audio-visual generative model, and not just a physics simulator / world-action model. It can understand images and videos, generate images, videos, and audio, simulate future worlds, predict actions, and generate robot policies—enabling models to truly begin to “touch the world.” - Cosmos 3 is the #1 open-weight reasoner / T2I / I2V / robot policy across many benchmarks. Huge thanks to every teammate who fought side by side on this journey—from architecture, data, training, infra, serving, and evaluation to post-training. Every part of this project carries an incredible amount of hard work. This was my first time leading a project as Tech Lead, and I feel truly fortunate. The future of Physical AI needs models that can not only “see” and “describe” the world, but also “imagine,” “simulate,” and “act”—and eventually close the loop with the real world. I hope Cosmos 3 can become an important starting point for this direction, and I’m excited to push Physical AI into its next stage together with the open-source community. Welcome to the era of Physical AI. HuggingFace: Project Website: Code:

Max Zhaoshuo Li 李赵硕

1,077,021 görüntüleme • 23 gün önce