Yunzhu Li's banner

Yunzhu Li

@YunzhuLiYZ • 9,742 subscribers

Co-Founder at SceniX, Assistant Professor @Columbia @ColumbiaCompSci, Postdoc from @Stanford @StanfordSVL, PhD from @MIT_CSAIL. #Robotics #Vision #Learning

Shorts

You can actually interact with the world simulator directly in the browser. 🤖 Here is a quick screen recording (8x speed) of me playing with it: real-time action-conditioned video prediction across rigid objects, deformable objects, rope, and object piles. Try it yourself (no install required): Huge kudos to my student Yixuan Wang for making the interactive demo happen!

You can actually interact with the world simulator directly in the browser. 🤖 Here is a quick screen recording (8x speed) of me playing with it: real-time action-conditioned video prediction across rigid objects, deformable objects, rope, and object piles. Try it yourself (no install required): Huge kudos to my student Yixuan Wang for making the interactive demo happen!

41,513 просмотров

Excited to share a few presentations, demos, and workshop talks from our group and collaborators at #ICRA2026! We will present recent work on real-to-sim-to-real robot policy evaluation, model-based planning with learned dynamics, and multi-modal manipulation. We will also have a joint live demo between SceniX and Analog Devices, Inc. on real-to-sim-to-real cable manipulation at the ICRA exhibition. This is a small teaser of what we have been building, with more to come soon! If you are at ICRA, please stop by the sessions or the demo booth. Happy to chat about robot learning, simulation, world models, and sim-to-real!

Excited to share a few presentations, demos, and workshop talks from our group and collaborators at #ICRA2026! We will present recent work on real-to-sim-to-real robot policy evaluation, model-based planning with learned dynamics, and multi-modal manipulation. We will also have a joint live demo between SceniX and Analog Devices, Inc. on real-to-sim-to-real cable manipulation at the ICRA exhibition. This is a small teaser of what we have been building, with more to come soon! If you are at ICRA, please stop by the sessions or the demo booth. Happy to chat about robot learning, simulation, world models, and sim-to-real!

10,855 просмотров

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share **PhysTwin**: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines **Gaussian splatting** with **inverse dynamics optimization** based on simple **spring-mass** systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share PhysTwin: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines Gaussian splatting with inverse dynamics optimization based on simple spring-mass systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

25,279 просмотров

I want to call out one of our most important references: SIMPLER ( by Xuanlin Li (Simon), Kyle Hsu, Jiayuan Gu, Jiajun Wu, Hao Su, Quan Vuong, Ted Xiao, and colleagues, which laid the foundation for using simulation for policy evaluation through a systematic study of appearance and dynamics alignment and metrics for measuring sim–real correlation. (It took many nights of Kaifeng Zhang and Shuo Sha's grinding for that correlation to appear — and once it did, extending to new tasks worked like a charm!)

I want to call out one of our most important references: SIMPLER ( by Xuanlin Li (Simon), Kyle Hsu, Jiayuan Gu, Jiajun Wu, Hao Su, Quan Vuong, Ted Xiao, and colleagues, which laid the foundation for using simulation for policy evaluation through a systematic study of appearance and dynamics alignment and metrics for measuring sim–real correlation. (It took many nights of Kaifeng Zhang and Shuo Sha's grinding for that correlation to appear — and once it did, extending to new tasks worked like a charm!)

11,870 просмотров

🚀 Excited to share our #ICLR2025 work on planning with neural dynamics models! While our lab has developed diverse neural dynamics models for manipulating rigid, deformable, and granular objects, having the model alone doesn’t solve the problem—planning with it remains a challenge. 💡 Enter BaB-ND, led by Keyi and Jiangwei! We propose a scalable, GPU-accelerated branch-and-bound algorithm, inspired by neural network verification, to enable effective planning for diverse objects modeled with neural dynamics. 🔗 Project page (open-source + detailed docs!): 🎥 Watch the video to see T being pushed around obstacles, and check out Keyi’s thread for more details!

🚀 Excited to share our #ICLR2025 work on planning with neural dynamics models! While our lab has developed diverse neural dynamics models for manipulating rigid, deformable, and granular objects, having the model alone doesn’t solve the problem—planning with it remains a challenge. 💡 Enter BaB-ND, led by Keyi and Jiangwei! We propose a scalable, GPU-accelerated branch-and-bound algorithm, inspired by neural network verification, to enable effective planning for diverse objects modeled with neural dynamics. 🔗 Project page (open-source + detailed docs!): 🎥 Watch the video to see T being pushed around obstacles, and check out Keyi’s thread for more details!

10,561 просмотров

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

📢 Announcing one of the most exciting works from us this year on **scalable robot policy evaluation through real-to-sim transfer**, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is **orders of magnitude slower** than the development of language agents. We propose a new framework where simulation performance **strongly correlates** with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing **state-of-the-art correlation** between simulation and reality for deformable object manipulation. It provides a **scalable and reproducible evaluation engine** for robot learning. 🌐

📢 Announcing one of the most exciting works from us this year on scalable robot policy evaluation through real-to-sim transfer, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is orders of magnitude slower than the development of language agents. We propose a new framework where simulation performance strongly correlates with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing state-of-the-art correlation between simulation and reality for deformable object manipulation. It provides a scalable and reproducible evaluation engine for robot learning. 🌐

39,960 просмотров • 8 месяцев назад

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share **PhysTwin**: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines **Gaussian splatting** with **inverse dynamics optimization** based on simple **spring-mass** systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share PhysTwin: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines Gaussian splatting with inverse dynamics optimization based on simple spring-mass systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

25,279 просмотров • 1 год назад

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

16,540 просмотров • 1 год назад

I was really impressed by the UMI gripper (Cheng Chi et al.), but a key limitation is that **force-related data wasn’t captured**: humans feel haptic feedback through the mechanical springs, but the robot couldn’t leverage that info, limiting the data’s value for fine-grained manipulation tasks. Led by my amazing students Yolanda Zhu and Binghao Huang, we designed a **portable visuo-tactile gripper** by integrating our dense, flexible tactile arrays with the UMI gripper to enable large-scale in-the-wild data collection. 🔗 We demonstrate **cross-modal representation learning** and **downstream policy learning** on tasks requiring in-hand state estimation (e.g., test tube reorientation) and fine-grained force sensing (e.g., pipette fluid transfer). Key takeaways: - Our flexible tactile arrays store the rich haptic information humans perceive as dense tactile signals. - Portability and robustness are key for in-the-wild data collection; our portable gripper is compact, lightweight, and durable. - Touch provides precise, robust measurements of in-hand object pose, invariant to lighting and viewpoint. - Cross-modal pretraining on large-scale in-the-wild data significantly improves policy robustness and sample efficiency (as shown many times before — and verified again here!). Also check out our previous investigations of dense, flexible tactile grids for understanding human-robot-environment interactions: - Dense tactile glove (Nature ’19): - 3D-ViTac (CoRL ’24):

I was really impressed by the UMI gripper (Cheng Chi et al.), but a key limitation is that force-related data wasn’t captured: humans feel haptic feedback through the mechanical springs, but the robot couldn’t leverage that info, limiting the data’s value for fine-grained manipulation tasks. Led by my amazing students Yolanda Zhu and Binghao Huang, we designed a portable visuo-tactile gripper by integrating our dense, flexible tactile arrays with the UMI gripper to enable large-scale in-the-wild data collection. 🔗 We demonstrate cross-modal representation learning and downstream policy learning on tasks requiring in-hand state estimation (e.g., test tube reorientation) and fine-grained force sensing (e.g., pipette fluid transfer). Key takeaways: - Our flexible tactile arrays store the rich haptic information humans perceive as dense tactile signals. - Portability and robustness are key for in-the-wild data collection; our portable gripper is compact, lightweight, and durable. - Touch provides precise, robust measurements of in-hand object pose, invariant to lighting and viewpoint. - Cross-modal pretraining on large-scale in-the-wild data significantly improves policy robustness and sample efficiency (as shown many times before — and verified again here!). Also check out our previous investigations of dense, flexible tactile grids for understanding human-robot-environment interactions: - Dense tactile glove (Nature ’19): - 3D-ViTac (CoRL ’24):

13,188 просмотров • 1 год назад

Больше нет контента для загрузки