
Jiafei Duan
@DJiafei • 6,025 subscribers
Incoming Assistant Professor at @NUScomputing| Robotics & AI PhD student @uwcse| Host of @RoboPapers| Ex-@allen_ai, @NVIDIA.
Shorts
Videos

Most capable generalist robotics models today are closed or at best, open weights. But robotics won’t reach its ChatGPT moment without real openness. That GPT moment was built on years of open tools and datasets such as Python, PyTorch, ImageNet and more, that let researchers inspect, reproduce, and build. Today, we’re introducing MolmoAct 2: a fully open-source action reasoning model for real-world robotics. We rethought and reshaped everything! 🧵👇
Jiafei Duan100,565 次观看 • 1 个月前

Instead of asking a VLM to output progress, it reads the model’s internal belief directly from token logits. No in-context learning. No fine-tuning. No reward training. 📈 We introduce: TOPReward, a zero-shot reward modeling approach for robotics using token probabilities from pretrained video VLMs. The simplest way of doing reward modelling for robotics! Project: 🧵👇
Jiafei Duan107,760 次观看 • 3 个月前

Why do generalist robotic models fail when a cup is moved just two inches to the left? It’s not a lack of motor skill, it’s an alignment problem. Today, we introduce VLS: Vision-Language Steering of Pretrained Robot Policies, a training-free framework that guides robot behavior in real time. Check out the project: 👇🧵 (Watch till the end: VLS runs uncut, steering pretrained policies across long-horizon tasks.)
Jiafei Duan68,034 次观看 • 3 个月前

Reasoning is central to purposeful action. Today we introduce MolmoAct — a fully open Action Reasoning Model (ARM) for robotics. Grounded in large-scale pre-training with action reasoning data, every predicted action is interpretable and user-steerable via visual trace. We are open-sourcing everything!
Jiafei Duan99,921 次观看 • 9 个月前

Can we build a generalist robotic policy that doesn’t just memorize training data and regurgitate it during test time, but instead remembers past actions as memory and conditions its decisions on them?🤖💡 Introducing SAM2Act—a multi-view robotic transformer-based policy that integrates a visual foundation model with a memory architecture for robotic manipulation. Project page: 🧵👇
Jiafei Duan87,573 次观看 • 1 年前

Introducing WildDet3D, a grounding model for monocular 3D object detection in the wild. A question I keep coming back to is: what is the right backbone for robotics foundation models? Should it be a video model, a language model, or perhaps a grounding model? WildDet3D is our first step in exploring that direction.
Jiafei Duan12,072 次观看 • 1 个月前

Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:
Jiafei Duan64,377 次观看 • 1 年前

Humans learn and improve from failures. Similarly, foundation models adapt based on human feedback. Can we leverage this failure understanding to enhance robotics systems that use foundation models? Introducing AHA—a vision-language model for detecting and reasoning over failures in robotic manipulation. Project page: 🧵Thread👇 Aha!
Jiafei Duan48,739 次观看 • 1 年前

What if robots could think longer on harder problems without saying a single word?🤔 We introduce RD-VLA (Recurrent-Depth VLA): a latent, iterative reasoning architecture for robot control. ❌No Chain-of-Thought tokens. ❌No extra memory overhead. ✅Just reasoning—directly in latent space. 🧠🤖 Project page: 👇🧵
Jiafei Duan13,741 次观看 • 3 个月前
没有更多内容可加载