Shuo Yang's banner

Shuo Yang

@ShuoYangAIR • 2,504 subscribers

CTO & Co-founder @mondorobotics / Ex Tesla | CMU PhD | Ex DJI

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.

31,596 次观看 • 4 个月前

DiT4DiT is now open source! As the first humanoid-deployable Video-Action Model built on a world model, DiT4DiT continues to surprise us. In our paper last month, we showed its strong data efficiency. Now, with only slight modifications, it enables real-time whole-body autonomous pick-and-place. Paper: Code: Website:

DiT4DiT is now open source! As the first humanoid-deployable Video-Action Model built on a world model, DiT4DiT continues to surprise us. In our paper last month, we showed its strong data efficiency. Now, with only slight modifications, it enables real-time whole-body autonomous pick-and-place. Paper: Code: Website:

14,308 次观看 • 3 个月前

没有更多内容可加载