Jiafei Duan's banner

Jiafei Duan

@DJiafei • 6,378 subscribers

Assistant Professor at @NUScomputing| Robotics & AI PhD @uwcse| Host of @RoboPapers| Ex-@allen_ai, @NVIDIA my opinion is my alone.

Shorts

My prediction: we’re about to see hundreds of robotics papers built on SAM3D. Waiting for the first one to drop 👀 SAM3 & SAM3D from AI at Meta are just too good to be true!

My prediction: we’re about to see hundreds of robotics papers built on SAM3D. Waiting for the first one to drop 👀 SAM3 & SAM3D from AI at Meta are just too good to be true!

31,825 views

Can your VLA perform collision avoidance when explicitly instructed in language? The answer is no. However, with VLS steering it is possible now! The same policy can successfully avoid collisions when given the instruction.

Can your VLA perform collision avoidance when explicitly instructed in language? The answer is no. However, with VLS steering it is possible now! The same policy can successfully avoid collisions when given the instruction.

15,036 views

To be honest. Evaluation in robotics has always been a challenge unlike in vision or language: not enough DATA, not enough TASKS, and too much reliance on FIXED benchmarks. PlayGround pushes us beyond static datasets and predefined tasks, moving robotics evaluation into the era of dynamic evaluation in structure physical domain one at a time. *Can't wait to see how the community scale on this!

To be honest. Evaluation in robotics has always been a challenge unlike in vision or language: not enough DATA, not enough TASKS, and too much reliance on FIXED benchmarks. PlayGround pushes us beyond static datasets and predefined tasks, moving robotics evaluation into the era of dynamic evaluation in structure physical domain one at a time. *Can't wait to see how the community scale on this!

10,506 views

Robot paper of the day: RoboBallet: Planning for multirobot reaching with graph neural networks and reinforcement learning UCL + Google DeepMind + Intrinsic built an AI planner that choreographs teams of arms to work in tight spaces without collisions—planning in seconds, not days.

Robot paper of the day: RoboBallet: Planning for multirobot reaching with graph neural networks and reinforcement learning UCL + Google DeepMind + Intrinsic built an AI planner that choreographs teams of arms to work in tight spaces without collisions—planning in seconds, not days.

15,084 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Most capable generalist robotics models today are closed or at best, open weights. But robotics won’t reach its ChatGPT moment without real openness. That GPT moment was built on years of open tools and datasets such as Python, PyTorch, ImageNet and more, that let researchers inspect, reproduce, and build. Today, we’re introducing MolmoAct 2: a fully open-source action reasoning model for real-world robotics. We rethought and reshaped everything! 🧵👇

Most capable generalist robotics models today are closed or at best, open weights. But robotics won’t reach its ChatGPT moment without real openness. That GPT moment was built on years of open tools and datasets such as Python, PyTorch, ImageNet and more, that let researchers inspect, reproduce, and build. Today, we’re introducing MolmoAct 2: a fully open-source action reasoning model for real-world robotics. We rethought and reshaped everything! 🧵👇

108,164 views • 2 months ago

Given the strong community adoption and real-world deployment of MolmoAct2 on YAM, we're introducing zero-shot evaluation of MolmoAct2 Bimanual YAM in simulation. Now you can test out our models without a real-world robot and build on them! Code: Simulation built on Maniskill!

Given the strong community adoption and real-world deployment of MolmoAct2 on YAM, we're introducing zero-shot evaluation of MolmoAct2 Bimanual YAM in simulation. Now you can test out our models without a real-world robot and build on them! Code: Simulation built on Maniskill!

36,290 views • 1 month ago

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇

107,926 views • 4 months ago

Why do generalist robotic models fail when a cup is moved just two inches to the left? It’s not a lack of motor skill, it’s an alignment problem. Today, we introduce VLS: Vision-Language Steering of Pretrained Robot Policies, a training-free framework that guides robot behavior in real time. Check out the project: 👇🧵 (Watch till the end: VLS runs uncut, steering pretrained policies across long-horizon tasks.)

Why do generalist robotic models fail when a cup is moved just two inches to the left? It’s not a lack of motor skill, it’s an alignment problem. Today, we introduce VLS: Vision-Language Steering of Pretrained Robot Policies, a training-free framework that guides robot behavior in real time. Check out the project: 👇🧵 (Watch till the end: VLS runs uncut, steering pretrained policies across long-horizon tasks.)

72,319 views • 5 months ago

Reasoning is central to purposeful action. Today we introduce MolmoAct — a fully open Action Reasoning Model (ARM) for robotics. Grounded in large-scale pre-training with action reasoning data, every predicted action is interpretable and user-steerable via visual trace. We are open-sourcing everything!

Reasoning is central to purposeful action. Today we introduce MolmoAct — a fully open Action Reasoning Model (ARM) for robotics. Grounded in large-scale pre-training with action reasoning data, every predicted action is interpretable and user-steerable via visual trace. We are open-sourcing everything!

99,944 views • 11 months ago

Finally hooded! Collecting Egocentric data even during graduation! Huge thanks to my advisor Dieter Fox and Ranjay Krishna !

Finally hooded! Collecting Egocentric data even during graduation! Huge thanks to my advisor Dieter Fox and Ranjay Krishna !

11,320 views • 1 month ago

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇

87,573 views • 1 year ago

Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:

Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:

64,377 views • 2 years ago

We pushed MolmoAct 2 into a completely out-of-distribution studio setting to test its robustness under extreme environmental changes. Despite only 10 minutes of task-specific fine-tuning data collected elsewhere, it adapted surprisingly well to the new environment.

We pushed MolmoAct 2 into a completely out-of-distribution studio setting to test its robustness under extreme environmental changes. Despite only 10 minutes of task-specific fine-tuning data collected elsewhere, it adapted surprisingly well to the new environment.

11,381 views • 2 months ago

Every time I watch this video, I can't help but wonder: why don't we have robot butlers in our homes yet? The hardware seemed capable 14 years ago with teleoperation. Is it just the "robot brain" we're missing, or is there more to the puzzle?

Every time I watch this video, I can't help but wonder: why don't we have robot butlers in our homes yet? The hardware seemed capable 14 years ago with teleoperation. Is it just the "robot brain" we're missing, or is there more to the puzzle?

47,387 views • 1 year ago

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!

48,773 views • 1 year ago

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.

12,072 views • 3 months ago

What if robots could think longer on harder problems without saying a single word?🤔 We introduce RD-VLA (Recurrent-Depth VLA): a latent, iterative reasoning architecture for robot control. ❌No Chain-of-Thought tokens. ❌No extra memory overhead. ✅Just reasoning—directly in latent space. 🧠🤖 Project page: 👇🧵

What if robots could think longer on harder problems without saying a single word?🤔 We introduce RD-VLA (Recurrent-Depth VLA): a latent, iterative reasoning architecture for robot control. ❌No Chain-of-Thought tokens. ❌No extra memory overhead. ✅Just reasoning—directly in latent space. 🧠🤖 Project page: 👇🧵

13,833 views • 5 months ago

For large-scale robotic deployment🤖 in the real-world 🌏, robots must adapt to changes in environment and objects. Ever questioned the generalizability of your robot's manipulation policy? Put it to the test with The Colosseum 🏛️. Check out our project:

For large-scale robotic deployment🤖 in the real-world 🌏, robots must adapt to changes in environment and objects. Ever questioned the generalizability of your robot's manipulation policy? Put it to the test with The Colosseum 🏛️. Check out our project:

36,617 views • 2 years ago

🚨Is it possible to devise an intuitive approach for crowdsourcing trainable data for robots without requiring a physical robot🤖? Can we democratize robot learning for all?🧑‍🤝‍🧑 Check out our latest #CoRL2023 paper-> AR2-D2: Training a Robot Without a Robot

🚨Is it possible to devise an intuitive approach for crowdsourcing trainable data for robots without requiring a physical robot🤖? Can we democratize robot learning for all?🧑‍🤝‍🧑 Check out our latest #CoRL2023 paper-> AR2-D2: Training a Robot Without a Robot

38,871 views • 2 years ago

No more content to load