
Shuo Yang
@ShuoYangAIR • 2,413 subscribers
CTO & Co-founder @ Mondo Robotics/ Ex Tesla | CMU PhD | Ex DJI
Videos

We’re excited to share DiT4DiT, an end-to-end Video-Action Model for robot learning that unifies a video Diffusion Transformer and an action Diffusion Transformer in a single cascaded framework. By leveraging the rich spatiotemporal and physical dynamics learned through video generation, rather than static image-text priors, DiT4DiT achieves state-of-the-art results on LIBERO (98.6%) and RoboCasa GR1 (50.8%) with far less training data, delivering over 10× better sample efficiency and up to 7× faster convergence. Real-world deployment on a humanoid robot further shows robust generalization. We believe this is a step toward making video generation a powerful backbone for robot policy learning. This work builds upon the brilliant foundations laid by Nvidia's GR00T and Cosmos. Project: Paper: Code: Coming soon. In the meantime, you can ask your coding agent to reproduce the method based on GR00T/Cosmos.
Shuo Yang31,438 次观看 • 2 个月前

DiT4DiT is now open source! As the first humanoid-deployable Video-Action Model built on a world model, DiT4DiT continues to surprise us. In our paper last month, we showed its strong data efficiency. Now, with only slight modifications, it enables real-time whole-body autonomous pick-and-place. Paper: Code: Website:
Shuo Yang13,771 次观看 • 1 个月前
没有更多内容可加载