正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Can we learn a 3D world model that predicts object dynamics directly from videos? Introducing Particle-Grid Neural Dynamics: a learning-based simulator for deformable objects that trains from real-world videos. Website: ArXiv: Code: Demo: To appear at #RSS2025

Kaifeng Zhang

2,777 subscribers

45,971 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

10 条评论

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

Modeling ropes, cloth, bags, etc. is hard because of their complex physics and partial observability. Classical simulators struggle to construct exact digital twins from real observations. We overcome these challenges by learning neural dynamics directly from videos.

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

Our particle-based neural dynamics model represents objects as dense 3D particles and predicts their next-step velocities to simulate object dynamics. It features three stages: particle encoding, grid-velocity editing, and grid-to-particle velocity transfer.

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

Trained with videos including robot–object interactions under self-supervion, PGND can model diverse deformable objects—including ropes, cloth, stuffed animals, and paper bags—using <20 minutes of data per object.

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

PGND becomes a 3D action-conditioned video generator when 3D Gaussian Splatting is plugged in. It aligns better with ground truth, producing visually more realistic deformations than the baseline.

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

PGND can also act as a photorealistic deformable-object simulator with a complete scan of the scene. Given only a static reconstruction, we simulate the segmented object’s motion with a sequence of robot actions (red arrows).

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

Finally, PGND serves as a 3D world model within Model Predictive Control. It guides dual-arm cloth lifting, rope shaping, box closing, and plush-toy relocation, achieving fast convergence to target configurations.

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

This work is a close collaboration between Columbia University @ColumbiaCompSci and University of Illinois Urbana-Champaign @siebelschool. Huge thanks to my co-authors: @YunzhuLiYZ, Kris Hauser, @BaoyuLi6 !

Carlos DP 的头像

Carlos DP1 年前

I love this, and that you made a hf space demo

Hongyu Li 的头像

Hongyu Li1 年前

This is an exciting work. Congrats!!

Kaifeng Zhang 的头像

Kaifeng Zhang1 年前

Thank you Hongyu!

相关视频

What if we can simulate an *interactive 3D world*, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 from Stanford University NVIDIA

What if we can simulate an interactive 3D world, from a single image, in the wild, in real time? Introducing PointWorld-1B: a large pre-trained 3D world model that predicts env dynamics given RGB-D capture and robot actions. 🌐 from Stanford University NVIDIA

Wenlong Huang

274,633 次观看 • 6 个月前

🧠Model-Based RL shows promises but has seen limited success in real-world robotics. 🌎Introducing Robotic World Model, a black-box end-to-end neural dynamics model that bridges this gap, where policies are trained purely in imagination. NeurIPS Conference 🎯

🧠Model-Based RL shows promises but has seen limited success in real-world robotics. 🌎Introducing Robotic World Model, a black-box end-to-end neural dynamics model that bridges this gap, where policies are trained purely in imagination. NeurIPS Conference 🎯

Chenhao Li

102,560 次观看 • 8 个月前

🌎World models can predict, but controlling real robots from imagination sees a long-standing failure due to hallucination. 🧠Introducing Uncertainty-Aware RWM: a black-box, end-to-end neural dynamics model with long-horizon uncertainty propagation. 🎯

🌎World models can predict, but controlling real robots from imagination sees a long-standing failure due to hallucination. 🧠Introducing Uncertainty-Aware RWM: a black-box, end-to-end neural dynamics model with long-horizon uncertainty propagation. 🎯

Chenhao Li

54,751 次观看 • 5 个月前

Introducing Starchild-1 from Odyssey, the first ever real-time multimodal world model. This a model that can generate interactive simulations of the world that you can—for the first time ever—hear. Starchild-1 represents a big step towards a general-purpose world simulator.

Introducing Starchild-1 from Odyssey, the first ever real-time multimodal world model. This a model that can generate interactive simulations of the world that you can—for the first time ever—hear. Starchild-1 represents a big step towards a general-purpose world simulator.

Oliver Cameron

245,931 次观看 • 2 个月前

📢 Announcing one of the most exciting works from us this year on **scalable robot policy evaluation through real-to-sim transfer**, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is **orders of magnitude slower** than the development of language agents. We propose a new framework where simulation performance **strongly correlates** with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing **state-of-the-art correlation** between simulation and reality for deformable object manipulation. It provides a **scalable and reproducible evaluation engine** for robot learning. 🌐

📢 Announcing one of the most exciting works from us this year on scalable robot policy evaluation through real-to-sim transfer, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is orders of magnitude slower than the development of language agents. We propose a new framework where simulation performance strongly correlates with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing state-of-the-art correlation between simulation and reality for deformable object manipulation. It provides a scalable and reproducible evaluation engine for robot learning. 🌐

Yunzhu Li

39,900 次观看 • 8 个月前

Ego-VCP: a learned ego-vision world model trained offline on demonstration-free random data that predicts dynamics in latent space for humanoid robots. Achieved robust real-time contact-rich planning on a real Unitree G1: bracing against walls, blocking flying objects, traversing low arches. Paper:

Ego-VCP: a learned ego-vision world model trained offline on demonstration-free random data that predicts dynamics in latent space for humanoid robots. Achieved robust real-time contact-rich planning on a real Unitree G1: bracing against walls, blocking flying objects, traversing low arches. Paper:

The Humanoid Hub

24,742 次观看 • 8 个月前

Introducing General World Models. We believe the next major advancement in AI will come from systems that understand the visual world and its dynamics, which is why we’re starting a new long-term research effort around general world models. Learn more:

Introducing General World Models. We believe the next major advancement in AI will come from systems that understand the visual world and its dynamics, which is why we’re starting a new long-term research effort around general world models. Learn more:

Runway

714,292 次观看 • 2 年前

Reinforcement learning is used to speed the production of behavior for the Boston Dynamics Atlas humanoid robot. At the heart of the learning process is a physics-based simulator that generates training data for a variety of maneuvers.

Reinforcement learning is used to speed the production of behavior for the Boston Dynamics Atlas humanoid robot. At the heart of the learning process is a physics-based simulator that generates training data for a variety of maneuvers.

RAI Institute

76,595 次观看 • 1 年前

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

SAM 2 from Meta FAIR is the first unified model for real-time, promptable object segmentation in images & videos. Using the model in our web-based demo you can segment, track and apply effects to objects in video in just a few clicks. Try SAM 2 ➡️

AI at Meta

88,918 次观看 • 1 年前

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

What if robots could learn real-world tasks from your perspective… without ever touching a robot? This is a system that trains robot policies using nothing but human-first, egocentric video data from smart glasses. No robots, no teleop, no sensors, just humans doing real tasks in the real world. Why it matters ✅ Learns robot policies from 20 minutes of human video; zero robot demos ✅ Generalizes to new objects, views, and even robot morphologies ✅ Uses 3D points for interpretable, spatially grounded learning ✅ Deploys directly to real-world robots with strong zero-shot success Thank you, Vincent Liu, for sharing!!! Learn more here: 🔗 Paper: 🌐 Website: 📍 BOOKMARK FOR LATER

Ilir Aliu - eu/acc

10,509 次观看 • 1 年前

DexNDM — a neural dynamics model for dexterous hands Galbot, Tsinghua University and Shanghai Qizhi Institute released DexNDM, a neuromuscular-style dynamics model for dexterous hands. The team reports it reduces reliance on large sets of flawless demonstrations: using biased real-world data, DexNDM bridges Sim→Real for in-hand rotations across arbitrary poses and axes, handling slender, tiny, and irregular objects. Developers say the method is moving from lab validation toward production-ready workflows.

DexNDM — a neural dynamics model for dexterous hands Galbot, Tsinghua University and Shanghai Qizhi Institute released DexNDM, a neuromuscular-style dynamics model for dexterous hands. The team reports it reduces reliance on large sets of flawless demonstrations: using biased real-world data, DexNDM bridges Sim→Real for in-hand rotations across arbitrary poses and axes, handling slender, tiny, and irregular objects. Developers say the method is moving from lab validation toward production-ready workflows.

RoboHub🤖

34,445 次观看 • 8 个月前

We are excited to share our latest work, "Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation", where a policy learns to adapt in the real world to unknown disturbances within 5 seconds, both with and without explicit state estimation, directly from visual features. Code released! PDF: Project Page: Starting from a simple analytical dynamics model, the system continuously learns residual dynamics from real-world data and embeds the refined model into a differentiable simulator. This enables fast, gradient-based policy updates that are far more sample-efficient than classical #ReinforcementLearning. We demonstrate rapid adaptation in <5 seconds in agile quadrotor control under challenging conditions, including added payloads, wind disturbances, and large sim-to-real gaps. In real-world experiments, our method reduces hovering error by up to 81% compared to L1-MPC and 55% compared to PPO-based adaptive methods. It also operates directly from visual features without explicit state estimation. Reference: “Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation” IEEE Robotics and Automation Letters, 2026 PDF: Video: Code: Website: Kudos to Michael Pan, Jiaxu Xing, Rudolf Reiter, Yifan Zhai, Elie Aljalbout! UZH Space Hub UZH IfI European Research Council (ERC) AUTOASSESS UZH Science University of Zurich

We are excited to share our latest work, "Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation", where a policy learns to adapt in the real world to unknown disturbances within 5 seconds, both with and without explicit state estimation, directly from visual features. Code released! PDF: Project Page: Starting from a simple analytical dynamics model, the system continuously learns residual dynamics from real-world data and embeds the refined model into a differentiable simulator. This enables fast, gradient-based policy updates that are far more sample-efficient than classical #ReinforcementLearning. We demonstrate rapid adaptation in <5 seconds in agile quadrotor control under challenging conditions, including added payloads, wind disturbances, and large sim-to-real gaps. In real-world experiments, our method reduces hovering error by up to 81% compared to L1-MPC and 55% compared to PPO-based adaptive methods. It also operates directly from visual features without explicit state estimation. Reference: “Learning on the Fly: Rapid Policy Adaptation via Differentiable Simulation” IEEE Robotics and Automation Letters, 2026 PDF: Video: Code: Website: Kudos to Michael Pan, Jiaxu Xing, Rudolf Reiter, Yifan Zhai, Elie Aljalbout! UZH Space Hub UZH IfI European Research Council (ERC) AUTOASSESS UZH Science University of Zurich

Davide Scaramuzza

19,184 次观看 • 6 个月前

Hopfield recall Spiking Neural network now this one is cool isn't it? -open source, link in comment- This is a Spiking Neural Network (SNN) that can learn and recall patterns through Hebbian learning, similar to a Hopfield network but using biologically-inspired spiking dynamics rather than energy-based settling.

Hopfield recall Spiking Neural network now this one is cool isn't it? -open source, link in comment- This is a Spiking Neural Network (SNN) that can learn and recall patterns through Hebbian learning, similar to a Hopfield network but using biologically-inspired spiking dynamics rather than energy-based settling.

echo.hive

31,452 次观看 • 6 个月前

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

[SIGGRAPH '25] EVA: Expressive Virtual Avatars from Multi-view Videos Contributions: 1. We introduce EVA, a novel method enabling full-body control with real-time, photo-realistic renderings, robustly handling loose clothing dynamics and various facial expressions. 2. We develop an expressive deformable template that generates a deformable human template mesh and employs a multi-stage tracking algorithm to faithfully capture facial expressions, body motions, and non-rigid deformations from multi-view videos. 3. We propose a disentangled 3D Gaussian appearance module that models the body and face independently, ensuring separated control and high-quality renderings.

MrNeRF

18,407 次观看 • 1 年前

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

What happens when robot world models learn from human experience at scale? 🤔 DreamDojo from NVIDIA Research is a generalist robot world model pretrained on 44K hours of egocentric human videos and then post-trained on robot data to generalize across new objects and environments. After distillation, it runs at 10 FPS for live teleoperation, policy evaluation, and model-based planning. Read the ICML paper to learn more 📄

NVIDIA Robotics

22,322 次观看 • 18 天前

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

This week, grounding DINO 1.5 was released It is a new model that uses text prompts to detect objects from videos and images in real-time Examples & demo to try below:

Allen T.

56,018 次观看 • 2 年前

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

Xiao Ma

46,323 次观看 • 1 年前

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from multi-view tokens (no motion vectors, no HexPlane); 🔹 Uses a clean, minimal Transformer backbone; 🔹 Generalizes with fast, high-quality feedforward rendering at any view and infinite frame rate. Check out more interactive demos and scaling behaviors on our homepage/paper. 👉Website: 👉Paper:

Can we scale 4D pretraining to learn general space-time representations that reconstruct an object from a few views at any time to any view at any other time? Introducing 4D-LRM: a Large Space-Time Reconstruction Model that ... 🔹 Predicts 4D Gaussian primitives directly from multi-view tokens (no motion vectors, no HexPlane); 🔹 Uses a clean, minimal Transformer backbone; 🔹 Generalizes with fast, high-quality feedforward rendering at any view and infinite frame rate. Check out more interactive demos and scaling behaviors on our homepage/paper. 👉Website: 👉Paper:

Martin Ziqiao Ma

21,811 次观看 • 1 年前

(1/2) Excited to share "Learning Neural Parametric Head Models" #CVPR2023! We capture over 5200 high-quality 3D human head scans from which we build a neural parametric head model that disentangles & expressions and deformations.

(1/2) Excited to share "Learning Neural Parametric Head Models" #CVPR2023! We capture over 5200 high-quality 3D human head scans from which we build a neural parametric head model that disentangles & expressions and deformations.

Matthias Niessner

53,312 次观看 • 3 年前

🔥Diffusion Masters Transparent Objects in Videos! We introduce #DKT, a foundation model that repurposes video diffusion for zero-shot depth and normal estimation on Transparent and Reflective Objects with Superior Temporal Consistency Demo: Code:

🔥Diffusion Masters Transparent Objects in Videos! We introduce #DKT, a foundation model that repurposes video diffusion for zero-shot depth and normal estimation on Transparent and Reflective Objects with Superior Temporal Consistency Demo: Code:

Chongjie Ye

27,950 次观看 • 7 个月前