Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Can we learn a 3D world model that predicts object dynamics directly from videos? Introducing Particle-Grid Neural Dynamics: a learning-based simulator for deformable objects that trains from real-world videos. Website: ArXiv: Code: Demo: To appear at #RSS2025

Kaifeng Zhang

2,777 subscribers

45,971 views • 1 year ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

10 Comments

Kaifeng Zhang1 year ago

Modeling ropes, cloth, bags, etc. is hard because of their complex physics and partial observability. Classical simulators struggle to construct exact digital twins from real observations. We overcome these challenges by learning neural dynamics directly from videos.

Kaifeng Zhang1 year ago

Our particle-based neural dynamics model represents objects as dense 3D particles and predicts their next-step velocities to simulate object dynamics. It features three stages: particle encoding, grid-velocity editing, and grid-to-particle velocity transfer.

Kaifeng Zhang1 year ago

Trained with videos including robot–object interactions under self-supervion, PGND can model diverse deformable objects—including ropes, cloth, stuffed animals, and paper bags—using <20 minutes of data per object.

Kaifeng Zhang1 year ago

PGND becomes a 3D action-conditioned video generator when 3D Gaussian Splatting is plugged in. It aligns better with ground truth, producing visually more realistic deformations than the baseline.

Kaifeng Zhang1 year ago

PGND can also act as a photorealistic deformable-object simulator with a complete scan of the scene. Given only a static reconstruction, we simulate the segmented object’s motion with a sequence of robot actions (red arrows).

Kaifeng Zhang1 year ago

Finally, PGND serves as a 3D world model within Model Predictive Control. It guides dual-arm cloth lifting, rope shaping, box closing, and plush-toy relocation, achieving fast convergence to target configurations.

Kaifeng Zhang1 year ago

This work is a close collaboration between Columbia University @ColumbiaCompSci and University of Illinois Urbana-Champaign @siebelschool. Huge thanks to my co-authors: @YunzhuLiYZ, Kris Hauser, @BaoyuLi6 !

Carlos DP1 year ago

I love this, and that you made a hf space demo

Hongyu Li1 year ago

This is an exciting work. Congrats!!

Kaifeng Zhang1 year ago

Thank you Hongyu!

Related Videos

What if we can simulate an *interactive 3D world*, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 from Stanford University NVIDIA

What if we can simulate an interactive 3D world, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 from Stanford University NVIDIA

Wenlong Huang

273,255 views • 5 months ago

🧠Model-Based RL shows promises but has seen limited success in real-world robotics. 🌎Introducing Robotic World Model, a black-box end-to-end neural dynamics model that bridges this gap, where policies are trained purely in imagination. NeurIPS Conference 🎯

🧠Model-Based RL shows promises but has seen limited success in real-world robotics. 🌎Introducing Robotic World Model, a black-box end-to-end neural dynamics model that bridges this gap, where policies are trained purely in imagination. NeurIPS Conference 🎯

Chenhao Li

101,700 views • 7 months ago

🌎World models can predict, but controlling real robots from imagination sees a long-standing failure due to hallucination. 🧠Introducing Uncertainty-Aware RWM: a black-box, end-to-end neural dynamics model with long-horizon uncertainty propagation. 🎯

🌎World models can predict, but controlling real robots from imagination sees a long-standing failure due to hallucination. 🧠Introducing Uncertainty-Aware RWM: a black-box, end-to-end neural dynamics model with long-horizon uncertainty propagation. 🎯

Chenhao Li

54,751 views • 5 months ago

Introducing Starchild-1 from Odyssey, the first ever real-time multimodal world model. This a model that can generate interactive simulations of the world that you can—for the first time ever—hear. Starchild-1 represents a big step towards a general-purpose world simulator.

Introducing Starchild-1 from Odyssey, the first ever real-time multimodal world model. This a model that can generate interactive simulations of the world that you can—for the first time ever—hear. Starchild-1 represents a big step towards a general-purpose world simulator.

Oliver Cameron

244,915 views • 1 month ago

📢 Announcing one of the most exciting works from us this year on **scalable robot policy evaluation through real-to-sim transfer**, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is **orders of magnitude slower** than the development of language agents. We propose a new framework where simulation performance **strongly correlates** with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing **state-of-the-art correlation** between simulation and reality for deformable object manipulation. It provides a **scalable and reproducible evaluation engine** for robot learning. 🌐

📢 Announcing one of the most exciting works from us this year on scalable robot policy evaluation through real-to-sim transfer, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is orders of magnitude slower than the development of language agents. We propose a new framework where simulation performance strongly correlates with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing state-of-the-art correlation between simulation and reality for deformable object manipulation. It provides a scalable and reproducible evaluation engine for robot learning. 🌐

Yunzhu Li

39,850 views • 7 months ago

Ego-VCP: a learned ego-vision world model trained offline on demonstration-free random data that predicts dynamics in latent space for humanoid robots. Achieved robust real-time contact-rich planning on a real Unitree G1: bracing against walls, blocking flying objects, traversing low arches. Paper:

Ego-VCP: a learned ego-vision world model trained offline on demonstration-free random data that predicts dynamics in latent space for humanoid robots. Achieved robust real-time contact-rich planning on a real Unitree G1: bracing against walls, blocking flying objects, traversing low arches. Paper:

The Humanoid Hub

24,742 views • 7 months ago

Introducing General World Models. We believe the next major advancement in AI will come from systems that understand the visual world and its dynamics, which is why we’re starting a new long-term research effort around general world models. Learn more:

Introducing General World Models. We believe the next major advancement in AI will come from systems that understand the visual world and its dynamics, which is why we’re starting a new long-term research effort around general world models. Learn more:

Runway

714,233 views • 2 years ago

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

AI at Meta

88,918 views • 1 year ago

Reinforcement learning is used to speed the production of behavior for the Boston Dynamics Atlas humanoid robot. At the heart of the learning process is a physics-based simulator that generates training data for a variety of maneuvers.

Reinforcement learning is used to speed the production of behavior for the Boston Dynamics Atlas humanoid robot. At the heart of the learning process is a physics-based simulator that generates training data for a variety of maneuvers.

RAI Institute

76,576 views • 1 year ago

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

Ilir Aliu - eu/acc

10,509 views • 1 year ago

DexNDM — a neural dynamics model for dexterous hands Galbot, Tsinghua University and Shanghai Qizhi Institute released DexNDM, a neuromuscular-style dynamics model for dexterous hands. The team reports it reduces reliance on large sets of flawless demonstrations: using biased real-world data, DexNDM bridges Sim→Real for in-hand rotations across arbitrary poses and axes, handling slender, tiny, and irregular objects. Developers say the method is moving from lab validation toward production-ready workflows.

DexNDM — a neural dynamics model for dexterous hands Galbot, Tsinghua University and Shanghai Qizhi Institute released DexNDM, a neuromuscular-style dynamics model for dexterous hands. The team reports it reduces reliance on large sets of flawless demonstrations: using biased real-world data, DexNDM bridges Sim→Real for in-hand rotations across arbitrary poses and axes, handling slender, tiny, and irregular objects. Developers say the method is moving from lab validation toward production-ready workflows.

RoboHub🤖

34,445 views • 7 months ago

We are excited to share our latest work, "Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation", where a policy learns to adapt in the real world to unknown disturbances within 5 seconds, both with and without explicit state estimation, directly from visual features. Code released! PDF: Project Page: Starting from a simple analytical dynamics model, the system continuously learns residual dynamics from real-world data and embeds the refined model into a differentiable simulator. This enables fast, gradient-based policy updates that are far more sample-efficient than classical #ReinforcementLearning. We demonstrate rapid adaptation in <5 seconds in agile quadrotor control under challenging conditions, including added payloads, wind disturbances, and large sim-to-real gaps. In real-world experiments, our method reduces hovering error by up to 81% compared to L1-MPC and 55% compared to PPO-based adaptive methods. It also operates directly from visual features without explicit state estimation. Reference: “Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation” IEEE Robotics and Automation Letters, 2026 PDF: Video: Code: Website: Kudos to Michael Pan, Jiaxu Xing, Rudolf Reiter, Yifan Zhai, Elie Aljalbout! UZH Space Hub UZH IfI European Research Council (ERC) AUTOASSESS UZH Science University of Zurich

We are excited to share our latest work, "Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation", where a policy learns to adapt in the real world to unknown disturbances within 5 seconds, both with and without explicit state estimation, directly from visual features. Code released! PDF: Project Page: Starting from a simple analytical dynamics model, the system continuously learns residual dynamics from real-world data and embeds the refined model into a differentiable simulator. This enables fast, gradient-based policy updates that are far more sample-efficient than classical #ReinforcementLearning. We demonstrate rapid adaptation in <5 seconds in agile quadrotor control under challenging conditions, including added payloads, wind disturbances, and large sim-to-real gaps. In real-world experiments, our method reduces hovering error by up to 81% compared to L1-MPC and 55% compared to PPO-based adaptive methods. It also operates directly from visual features without explicit state estimation. Reference: “Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation” IEEE Robotics and Automation Letters, 2026 PDF: Video: Code: Website: Kudos to Michael Pan, Jiaxu Xing, Rudolf Reiter, Yifan Zhai, Elie Aljalbout! UZH Space Hub UZH IfI European Research Council (ERC) AUTOASSESS UZH Science University of Zurich

Davide Scaramuzza

19,103 views • 5 months ago

Hopfield recall Spiking Neural network now this one is cool isn't it? -open source, link in comment- This is a Spiking Neural Network (SNN) that can learn and recall patterns through Hebbian learning, similar to a Hopfield network but using biologically-inspired spiking dynamics rather than energy-based settling.

Hopfield recall Spiking Neural network now this one is cool isn't it? -open source, link in comment- This is a Spiking Neural Network (SNN) that can learn and recall patterns through Hebbian learning, similar to a Hopfield network but using biologically-inspired spiking dynamics rather than energy-based settling.

echo.hive

31,452 views • 5 months ago

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

MrNeRF

18,407 views • 1 year ago

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

Allen T.

56,015 views • 2 years ago

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

Xiao Ma

46,260 views • 11 months ago

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from multi-view tokens (no motion vectors, no HexPlane); 🔹 Uses a clean, minimal Transformer backbone; 🔹 Generalizes with fast, high-quality feedforward rendering at any view and infinite frame rate. Check out more interactive demos and scaling behaviors on our homepage/paper. 👉Website: 👉Paper:

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from multi-view tokens (no motion vectors, no HexPlane); 🔹 Uses a clean, minimal Transformer backbone; 🔹 Generalizes with fast, high-quality feedforward rendering at any view and infinite frame rate. Check out more interactive demos and scaling behaviors on our homepage/paper. 👉Website: 👉Paper:

Martin Ziqiao Ma

21,811 views • 1 year ago

(1/2) Excited to share "Learning Neural Parametric Head Models" #CVPR2023! We capture over 5200 high-quality 3D human head scans from which we build a neural parametric head model that disentangles & expressions and deformations.

(1/2) Excited to share "Learning Neural Parametric Head Models" #CVPR2023! We capture over 5200 high-quality 3D human head scans from which we build a neural parametric head model that disentangles & expressions and deformations.

Matthias Niessner

53,279 views • 3 years ago

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

Wenlong Huang

105,597 views • 3 months ago