Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:

Jiafei Duan

4,945 subscribers

64,377 views • 2 years ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

10 Comments

Jiafei Duan2 years ago

1/8🧵: Language is not precise enough to guide robot behavior. Even the most powerful VLMs like @chatgpt4o have limited accuracy in real robot execution, especially when the task uses spatial relations to identify objects or refer to object-free locations, e.g. “place the cup next to the plate".

Jiafei Duan2 years ago

2/8🧵: Studies in developmental psychology [Tomasello et al (2007)] show that infants and adults alike share information about their environment via pointing, leading to intuitive plans and precise actions. RoboPoint mimics human pointing by generating actions as points in the RGB image.

Jiafei Duan2 years ago

3/8🧵: As a result, RoboPoint generates precise action guidance. Its generic action space applies to multiple downstream tasks including manipulation, navigation and AR. RoboPoint outperforms Qwen-VL, GPT-4V, and PIVOT by 30.5% on average success rate across all tasks.

Jiafei Duan2 years ago

4/8🧵: RoboPoint is instruction-tuned on a mix of real VQA data and procedurally generated robotic data. Surprisingly, by adding synthetic data with templated language, the resulting model's performance improves on real images with natural language commands.

Jiafei Duan2 years ago

5/8🧵: RoboPoint outperforms state-of-the-art VLMs with visual prompting including @chatgpt4o, Qwen-VL, LLaVA-NeXT, SpaceVLM by 21.8% in the accuracy of predicting spatial affordance. It generalizes to relation types unseen during training such as “in the middle”.

Jiafei Duan2 years ago

6/8🧵: RoboPoint generates consistent predictions with moving camera. Its predictions are equivariant for view-dependent relations like “to the right” and invariant for view-independent relations like “in between”.

Jiafei Duan2 years ago

7/8🧵: Exciting concurrent works (SpatialRGPT @anjjei , SpatialVLM @BoyuanChen0 , PIVOT @xf1280 , MOKA @fangchenliu_ ) highlight the trend of enhancing spatial reasoning in robotic VLMs. RoboPoint adds to this by exploring precise affordance prediction through instruction-tuning.

Jiafei Duan2 years ago

8/8🧵: This great work was led by @TonyWentaoYuan , assisted by me and many other collaborators from @nvidia @allen_ai @uwcse

Jiafei Duan2 years ago

@nvidia @allen_ai @uwcse Lastly, both of us will be at @CVPR #CVPR2024 ! So come and talk to us👋

RoboDepot🤖2 years ago

An important ability

Related Videos

Can we collect robot data without any robots? Introducing Universal Manipulation Interface (UMI) An open-source $400 system from Stanford University designed to democratize robot data collection 0 teleop -> autonomously wash dishes (precise), toss (dynamic), and fold clothes (bimanual)

Can we collect robot data without any robots? Introducing Universal Manipulation Interface (UMI) An open-source $400 system from Stanford University designed to democratize robot data collection 0 teleop -> autonomously wash dishes (precise), toss (dynamic), and fold clothes (bimanual)

Cheng Chi

438,741 views • 2 years ago

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang

86,652 views • 2 years ago

Very happy to share our new work APRL (+ open-source code release)! The important step forward we took here is enabling the robot to keep improving with more data—walking faster and adapting to new situations—where prior work saturates.

Very happy to share our new work APRL (+ open-source code release)! The important step forward we took here is enabling the robot to keep improving with more data—walking faster and adapting to new situations—where prior work saturates.

Laura Smith

48,024 views • 2 years ago

Allow us to introduce you to Hugging Face's new open-source robot HopeJR. The full-size humanoid robot can walk, pick things up, and could be shipped before the end of the year 🤖

Allow us to introduce you to Hugging Face's new open-source robot HopeJR. The full-size humanoid robot can walk, pick things up, and could be shipped before the end of the year 🤖

TechCrunch

42,562 views • 1 year ago

PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!

PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!

Karol Hausman

182,789 views • 2 years ago

How to rotate a tomato with a potato using a robot hand? 🤖🍅🥔 Our new model, Robot Synesthesia, blends touch and vision to manipulate multiple objects, even non-convex ones like a cross!

How to rotate a tomato with a potato using a robot hand? 🤖🍅🥔 Our new model, Robot Synesthesia, blends touch and vision to manipulate multiple objects, even non-convex ones like a cross!

Yuzhe Qin

78,609 views • 2 years ago

Introducing LeLamp. An emotional and expressive robot to reshape our attachment to technology. Pre-orders open now (link below).

Introducing LeLamp. An emotional and expressive robot to reshape our attachment to technology. Pre-orders open now (link below).

Shahvir Sarkary

181,320 views • 7 months ago

How to harness foundation models for *generalization in the wild* in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

How to harness foundation models for generalization in the wild in robot manipulation? Introducing VoxPoser: use LLM+VLM to label affordances and constraints directly in 3D perceptual space for zero-shot robot manipulation in the real world! 🌐 🧵👇

Wenlong Huang

293,876 views • 3 years ago

See our robot open doors, water plants, assemble a box, and more by learning from watching humans. Using <1hr of robot data.

See our robot open doors, water plants, assemble a box, and more by learning from watching humans. Using <1hr of robot data.

Skild AI

36,080 views • 5 months ago

Introduce Open-𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧🤖: ⁣ We need an intuitive and remote teleoperation interface to collect more robot data. 𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧 lets you immersively operate a robot even if you are 3000 miles away, like in the movie 𝘈𝘷𝘢𝘵𝘢𝘳. Open-sourced!

Introduce Open-𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧🤖: ⁣ We need an intuitive and remote teleoperation interface to collect more robot data. 𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧 lets you immersively operate a robot even if you are 3000 miles away, like in the movie 𝘈𝘷𝘢𝘵𝘢𝘳. Open-sourced!

Xuxin Cheng

329,417 views • 2 years ago

Can you teach AI to read sign language? Actually, yes! Our AI solution aims to help people who use Brazilian Sign Language to communicate in real-time with anyone, anywhere, in any language. Learn more: | #LenovoTechWorld

Can you teach AI to read sign language? Actually, yes! Our AI solution aims to help people who use Brazilian Sign Language to communicate in real-time with anyone, anywhere, in any language. Learn more: | #LenovoTechWorld

Lenovo

48,411,033 views • 2 years ago

🇨🇦 ROBOT HANDS JUST LEVELED UP IN CANADA Sanctuary AI showed off a robotic hand that uses hydraulic power to grip, twist, and move like a real one. The hand can handle delicate objects, like dice, without crushing them, proving robots are getting scary precise. Hydraulic actuation gives the robot smoother, stronger control, making it better for heavy industrial work and tricky small tasks. This could change how robots work in factories, warehouses, or even in jobs too dangerous for humans. The future of robot hands? Less clunky claws, more human-like grip power with machine muscle. Source: Wevolver

🇨🇦 ROBOT HANDS JUST LEVELED UP IN CANADA Sanctuary AI showed off a robotic hand that uses hydraulic power to grip, twist, and move like a real one. The hand can handle delicate objects, like dice, without crushing them, proving robots are getting scary precise. Hydraulic actuation gives the robot smoother, stronger control, making it better for heavy industrial work and tricky small tasks. This could change how robots work in factories, warehouses, or even in jobs too dangerous for humans. The future of robot hands? Less clunky claws, more human-like grip power with machine muscle. Source: Wevolver

Mario Nawfal

50,661 views • 10 months ago

Excited to release FAST, our new robot action tokenizer! 🤖 Some highlights: - Simple autoregressive VLAs match diffusion VLA performance - Trains up to 5x faster - Works on all robot datasets we tested - First VLAs that work out-of-the-box in new environments! 🧵/

Excited to release FAST, our new robot action tokenizer! 🤖 Some highlights: - Simple autoregressive VLAs match diffusion VLA performance - Trains up to 5x faster - Works on all robot datasets we tested - First VLAs that work out-of-the-box in new environments! 🧵/

Karl Pertsch

90,625 views • 1 year ago

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

Physical Intelligence

457,147 views • 2 months ago

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

Hongyu Li

105,471 views • 8 months ago

Excited to release OK-Robot, an open-vocabulary mobile-manipulator for homes. Simply tell the robot what to pick and where to drop it in natural language, and it will do it. Like: Me: "OK Robot, move the Takis from the desk to the nightstand" Robot: ⬇️

Excited to release OK-Robot, an open-vocabulary mobile-manipulator for homes. Simply tell the robot what to pick and where to drop it in natural language, and it will do it. Like: Me: "OK Robot, move the Takis from the desk to the nightstand" Robot: ⬇️

Lerrel Pinto

152,251 views • 2 years ago

ELON: OPTIMUS ROBOT WILL PREPARE THE WAY FOR HUMANS “The first flights there we will send with the Optimus robot, so it can go out there to explore and kind of prepare the way for humans. By launching end of next year. We’ll actually technically arrive in 2027.” Source: Elon Musk The Humanoid Hub

ELON: OPTIMUS ROBOT WILL PREPARE THE WAY FOR HUMANS “The first flights there we will send with the Optimus robot, so it can go out there to explore and kind of prepare the way for humans. By launching end of next year. We’ll actually technically arrive in 2027.” Source: Elon Musk The Humanoid Hub

Mario Nawfal

234,990 views • 1 year ago

💜 𝗟𝗜𝗞𝗘 + 𝗥𝗧 + 𝗖𝗢𝗠𝗠𝗘𝗡𝗧 "𝗚 𝗙𝗨𝗘𝗟" TO WIN A #GFUEL ITEM OF YOUR CHOICE! 🤖 2 WINNERS PICKED MONDAY BECAUSE WE LOVE HUMANS AND I'M TOTALLY NOT A ROBOT!

💜 𝗟𝗜𝗞𝗘 + 𝗥𝗧 + 𝗖𝗢𝗠𝗠𝗘𝗡𝗧 "𝗚 𝗙𝗨𝗘𝗟" TO WIN A #GFUEL ITEM OF YOUR CHOICE! 🤖 2 WINNERS PICKED MONDAY BECAUSE WE LOVE HUMANS AND I'M TOTALLY NOT A ROBOT!

G FUEL®

114,585 views • 3 years ago

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

Yang Gao

124,995 views • 1 year ago