
Yunzhu Li
@YunzhuLiYZ • 9,003 subscribers
Assistant Professor of Computer Science @Columbia @ColumbiaCompSci, Postdoc from @Stanford @StanfordSVL, PhD from @MIT_CSAIL. #Robotics #Vision #Learning
Shorts
Videos

📢 Announcing one of the most exciting works from us this year on **scalable robot policy evaluation through real-to-sim transfer**, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is **orders of magnitude slower** than the development of language agents. We propose a new framework where simulation performance **strongly correlates** with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing **state-of-the-art correlation** between simulation and reality for deformable object manipulation. It provides a **scalable and reproducible evaluation engine** for robot learning. 🌐
Yunzhu Li39,797 görüntüleme • 7 ay önce

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share **PhysTwin**: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines **Gaussian splatting** with **inverse dynamics optimization** based on simple **spring-mass** systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.
Yunzhu Li25,279 görüntüleme • 1 yıl önce

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.
Yunzhu Li16,529 görüntüleme • 11 ay önce

I was really impressed by the UMI gripper (Cheng Chi et al.), but a key limitation is that **force-related data wasn’t captured**: humans feel haptic feedback through the mechanical springs, but the robot couldn’t leverage that info, limiting the data’s value for fine-grained manipulation tasks. Led by my amazing students Yolanda Zhu and Binghao Huang, we designed a **portable visuo-tactile gripper** by integrating our dense, flexible tactile arrays with the UMI gripper to enable large-scale in-the-wild data collection. 🔗 We demonstrate **cross-modal representation learning** and **downstream policy learning** on tasks requiring in-hand state estimation (e.g., test tube reorientation) and fine-grained force sensing (e.g., pipette fluid transfer). Key takeaways: - Our flexible tactile arrays store the rich haptic information humans perceive as dense tactile signals. - Portability and robustness are key for in-the-wild data collection; our portable gripper is compact, lightweight, and durable. - Touch provides precise, robust measurements of in-hand object pose, invariant to lighting and viewpoint. - Cross-modal pretraining on large-scale in-the-wild data significantly improves policy robustness and sample efficiency (as shown many times before — and verified again here!). Also check out our previous investigations of dense, flexible tactile grids for understanding human-robot-environment interactions: - Dense tactile glove (Nature ’19): - 3D-ViTac (CoRL ’24):
Yunzhu Li13,188 görüntüleme • 10 ay önce
Daha fazla içerik yok.