
Jiafei Duan
@DJiafei • 6,025 subscribers
Incoming Assistant Professor at @NUScomputing| Robotics & AI PhD student @uwcse| Host of @RoboPapers| Ex-@allen_ai, @NVIDIA.
Shorts
Videos

Most capable generalist robotics models today are closed or at best, open weights. But robotics won’t reach its ChatGPT moment without real openness. That GPT moment was built on years of open tools and datasets such as Python, PyTorch, ImageNet and more, that let researchers inspect, reproduce, and build. Today, we’re introducing MolmoAct 2: a fully open-source action reasoning model for real-world robotics. We rethought and reshaped everything! 🧵👇
Jiafei Duan100,565 views • 1 month ago

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇
Jiafei Duan107,760 views • 3 months ago

Why do generalist robotic models fail when a cup is moved just two inches to the left? It’s not a lack of motor skill, it’s an alignment problem. Today, we introduce VLS: Vision-Language Steering of Pretrained Robot Policies, a training-free framework that guides robot behavior in real time. Check out the project: 👇🧵 (Watch till the end: VLS runs uncut, steering pretrained policies across long-horizon tasks.)
Jiafei Duan68,034 views • 3 months ago

Reasoning is central to purposeful action. Today we introduce MolmoAct — a fully open Action Reasoning Model (ARM) for robotics. Grounded in large-scale pre-training with action reasoning data, every predicted action is interpretable and user-steerable via visual trace. We are open-sourcing everything!
Jiafei Duan99,921 views • 9 months ago

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇
Jiafei Duan87,573 views • 1 year ago

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.
Jiafei Duan12,072 views • 1 month ago

Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:
Jiafei Duan64,377 views • 1 year ago

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!
Jiafei Duan48,739 views • 1 year ago

What if robots could think longer on harder problems without saying a single word?🤔 We introduce RD-VLA (Recurrent-Depth VLA): a latent, iterative reasoning architecture for robot control. ❌No Chain-of-Thought tokens. ❌No extra memory overhead. ✅Just reasoning—directly in latent space. 🧠🤖 Project page: 👇🧵
Jiafei Duan13,741 views • 3 months ago
No more content to load