正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Humans use pointing to communicate plans intuitively. Compared to language, pointing gives more precise guidance to robot behaviors. Can we teach a robot how to point like humans? Introducing RoboPoint 🤖👉, an open-source VLM instruction-tuned to point. Check out our new work:

Jiafei Duan

6,395 subscribers

64,511 次观看 • 2 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

10 条评论

Jiafei Duan 的头像

Jiafei Duan2 年前

1/8🧵: Language is not precise enough to guide robot behavior. Even the most powerful VLMs like @chatgpt4o have limited accuracy in real robot execution, especially when the task uses spatial relations to identify objects or refer to object-free locations, e.g. “place the cup next to the plate".

Jiafei Duan 的头像

Jiafei Duan2 年前

2/8🧵: Studies in developmental psychology [Tomasello et al (2007)] show that infants and adults alike share information about their environment via pointing, leading to intuitive plans and precise actions. RoboPoint mimics human pointing by generating actions as points in the RGB image.

Jiafei Duan 的头像

Jiafei Duan2 年前

3/8🧵: As a result, RoboPoint generates precise action guidance. Its generic action space applies to multiple downstream tasks including manipulation, navigation and AR. RoboPoint outperforms Qwen-VL, GPT-4V, and PIVOT by 30.5% on average success rate across all tasks.

Jiafei Duan 的头像

Jiafei Duan2 年前

4/8🧵: RoboPoint is instruction-tuned on a mix of real VQA data and procedurally generated robotic data. Surprisingly, by adding synthetic data with templated language, the resulting model's performance improves on real images with natural language commands.

Jiafei Duan 的头像

Jiafei Duan2 年前

5/8🧵: RoboPoint outperforms state-of-the-art VLMs with visual prompting including @chatgpt4o, Qwen-VL, LLaVA-NeXT, SpaceVLM by 21.8% in the accuracy of predicting spatial affordance. It generalizes to relation types unseen during training such as “in the middle”.

Jiafei Duan 的头像

Jiafei Duan2 年前

6/8🧵: RoboPoint generates consistent predictions with moving camera. Its predictions are equivariant for view-dependent relations like “to the right” and invariant for view-independent relations like “in between”.

Jiafei Duan 的头像

Jiafei Duan2 年前

7/8🧵: Exciting concurrent works (SpatialRGPT @anjjei , SpatialVLM @BoyuanChen0 , PIVOT @xf1280 , MOKA @fangchenliu_ ) highlight the trend of enhancing spatial reasoning in robotic VLMs. RoboPoint adds to this by exploring precise affordance prediction through instruction-tuning.

Jiafei Duan 的头像

Jiafei Duan2 年前

8/8🧵: This great work was led by @TonyWentaoYuan , assisted by me and many other collaborators from @nvidia @allen_ai @uwcse

Jiafei Duan 的头像

Jiafei Duan2 年前

@nvidia @allen_ai @uwcse Lastly, both of us will be at @CVPR #CVPR2024 ! So come and talk to us👋

RoboDepot🤖 的头像

RoboDepot🤖2 年前

An important ability

相关视频

Can we collect robot data without any robots? Introducing Universal Manipulation Interface (UMI) An open-source $400 system from Stanford University designed to democratize robot data collection 0 teleop -> autonomously wash dishes (precise), toss (dynamic), and fold clothes (bimanual)

Can we collect robot data without any robots? Introducing Universal Manipulation Interface (UMI) An open-source $400 system from Stanford University designed to democratize robot data collection 0 teleop -> autonomously wash dishes (precise), toss (dynamic), and fold clothes (bimanual)

Cheng Chi

439,244 次观看 • 2 年前

🚨Is it possible to devise an intuitive approach for crowdsourcing trainable data for robots without requiring a physical robot🤖? Can we democratize robot learning for all?🧑‍🤝‍🧑 Check out our latest #CoRL2023 paper-> AR2-D2: Training a Robot Without a Robot

🚨Is it possible to devise an intuitive approach for crowdsourcing trainable data for robots without requiring a physical robot🤖? Can we democratize robot learning for all?🧑‍🤝‍🧑 Check out our latest #CoRL2023 paper-> AR2-D2: Training a Robot Without a Robot

Jiafei Duan

38,871 次观看 • 2 年前

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

We can teach LLMs to write better robot code through natural language feedback. But can LLMs remember what they were taught and improve their teachability over time? Introducing our latest work, Learning to Learn Faster from Human Feedback with Language Model Predictive Control

Jacky Liang

86,680 次观看 • 2 年前

Very happy to share our new work APRL (+ open-source code release)! The important step forward we took here is enabling the robot to keep improving with more data—walking faster and adapting to new situations—where prior work saturates.

Very happy to share our new work APRL (+ open-source code release)! The important step forward we took here is enabling the robot to keep improving with more data—walking faster and adapting to new situations—where prior work saturates.

Laura Smith

48,024 次观看 • 2 年前

PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!

PaLM-E or GPT-4 can speak in many languages and understand images. What if they could speak robot actions? Introducing RT-2: our new model that uses a VLM (up to 55B params) backbone and fine-tunes it to directly output robot actions!

Karol Hausman

182,789 次观看 • 3 年前

How to rotate a tomato with a potato using a robot hand? 🤖🍅🥔 Our new model, Robot Synesthesia, blends touch and vision to manipulate multiple objects, even non-convex ones like a cross!

How to rotate a tomato with a potato using a robot hand? 🤖🍅🥔 Our new model, Robot Synesthesia, blends touch and vision to manipulate multiple objects, even non-convex ones like a cross!

Yuzhe Qin

78,609 次观看 • 2 年前

Introduce Open-𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧🤖: ⁣ We need an intuitive and remote teleoperation interface to collect more robot data. 𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧 lets you immersively operate a robot even if you are 3000 miles away, like in the movie 𝘈𝘷𝘢𝘵𝘢𝘳. Open-sourced!

Introduce Open-𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧🤖: ⁣ We need an intuitive and remote teleoperation interface to collect more robot data. 𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧 lets you immersively operate a robot even if you are 3000 miles away, like in the movie 𝘈𝘷𝘢𝘵𝘢𝘳. Open-sourced!

Xuxin Cheng

329,503 次观看 • 2 年前

Can VLMs enable robots to autonomously improve? In our new work we ran a fleet of robot arms to collect autonomous data with VLM-proposed tasks and showed that robots can keep getting better as they are deployed, without supervision: 🧵👇

Can VLMs enable robots to autonomously improve? In our new work we ran a fleet of robot arms to collect autonomous data with VLM-proposed tasks and showed that robots can keep getting better as they are deployed, without supervision: 🧵👇

Sergey Levine

19,449 次观看 • 2 年前

🇨🇦 ROBOT HANDS JUST LEVELED UP IN CANADA Sanctuary AI showed off a robotic hand that uses hydraulic power to grip, twist, and move like a real one. The hand can handle delicate objects, like dice, without crushing them, proving robots are getting scary precise. Hydraulic actuation gives the robot smoother, stronger control, making it better for heavy industrial work and tricky small tasks. This could change how robots work in factories, warehouses, or even in jobs too dangerous for humans. The future of robot hands? Less clunky claws, more human-like grip power with machine muscle. Source: Wevolver

🇨🇦 ROBOT HANDS JUST LEVELED UP IN CANADA Sanctuary AI showed off a robotic hand that uses hydraulic power to grip, twist, and move like a real one. The hand can handle delicate objects, like dice, without crushing them, proving robots are getting scary precise. Hydraulic actuation gives the robot smoother, stronger control, making it better for heavy industrial work and tricky small tasks. This could change how robots work in factories, warehouses, or even in jobs too dangerous for humans. The future of robot hands? Less clunky claws, more human-like grip power with machine muscle. Source: Wevolver

Mario Nawfal

50,661 次观看 • 11 个月前

Excited to release FAST, our new robot action tokenizer! 🤖 Some highlights: - Simple autoregressive VLAs match diffusion VLA performance - Trains up to 5x faster - Works on all robot datasets we tested - First VLAs that work out-of-the-box in new environments! 🧵/

Excited to release FAST, our new robot action tokenizer! 🤖 Some highlights: - Simple autoregressive VLAs match diffusion VLA performance - Trains up to 5x faster - Works on all robot datasets we tested - First VLAs that work out-of-the-box in new environments! 🧵/

Karl Pertsch

90,625 次观看 • 1 年前

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

Our newest model, π0.7, has some interesting emergent capabilities: it can control a new robot to fold shirts for which we had no shirt folding data, figure out how to use an appliance with language-based coaching, and perform a wide range of dexterous tasks all in one model!

Physical Intelligence

465,465 次观看 • 3 个月前

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

Hongyu Li

105,613 次观看 • 9 个月前

ELON: OPTIMUS ROBOT WILL PREPARE THE WAY FOR HUMANS “The first flights there we will send with the Optimus robot, so it can go out there to explore and kind of prepare the way for humans. By launching end of next year. We’ll actually technically arrive in 2027.” Source: Elon Musk The Humanoid Hub

ELON: OPTIMUS ROBOT WILL PREPARE THE WAY FOR HUMANS “The first flights there we will send with the Optimus robot, so it can go out there to explore and kind of prepare the way for humans. By launching end of next year. We’ll actually technically arrive in 2027.” Source: Elon Musk The Humanoid Hub

Mario Nawfal

234,990 次观看 • 1 年前

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

🤖 How can robot policies zero-shot generalize to any new environment and any new object? Introducing our new project: 🚀Data Scaling Laws in Imitation Learning for Robotic Manipulation🚀—bringing us closer to the dream of having robots work as waiters in hot pot restaurants! 🍲

Yang Gao

125,057 次观看 • 1 年前

Can we bring human-like Touch to robots🤖? Introducing our CoRL work on 3D-ViTac. Humans rely on both vision 👁️ and touch 🫳 for complex tasks. With combined visual-tactile sensing, robots can now tackle challenging tasks, like precise in-hand reorientation, fragile objects grasping. Website: #Robotics #CoRL2024 #Touch #tactile #AI #ML

Can we bring human-like Touch to robots🤖? Introducing our CoRL work on 3D-ViTac. Humans rely on both vision 👁️ and touch 🫳 for complex tasks. With combined visual-tactile sensing, robots can now tackle challenging tasks, like precise in-hand reorientation, fragile objects grasping. Website: #Robotics #CoRL2024 #Touch #tactile #AI #ML

Binghao Huang

49,544 次观看 • 1 年前

New research from Meta FAIR: Large Concept Models (LCM) is a fundamentally different paradigm for language modeling that decouples reasoning from language representation, inspired by how humans can plan high-level thoughts to communicate.

New research from Meta FAIR: Large Concept Models (LCM) is a fundamentally different paradigm for language modeling that decouples reasoning from language representation, inspired by how humans can plan high-level thoughts to communicate.

AI at Meta

531,586 次观看 • 1 年前

Can we use wearable devices to collect robot data without actual robots? Yes! With a pair of gloves🧤! Introducing DexCap, a portable hand motion capture system that collects 3D data (point cloud + finger motion) for training robots with dexterous hands Everything open-sourced

Can we use wearable devices to collect robot data without actual robots? Yes! With a pair of gloves🧤! Introducing DexCap, a portable hand motion capture system that collects 3D data (point cloud + finger motion) for training robots with dexterous hands Everything open-sourced

Chen Wang

234,949 次观看 • 2 年前

While humans acted as GPT-5’s hands for carrying out the protocols, we also piloted an autonomous robot. It was built to execute arbitrary Gibson cloning protocols from natural language, with human supervision for safety.

While humans acted as GPT-5’s hands for carrying out the protocols, we also piloted an autonomous robot. It was built to execute arbitrary Gibson cloning protocols from natural language, with human supervision for safety.

Miles Wang

149,348 次观看 • 7 个月前

Happy #NationalRoboticsWeek! Last year, we gave you a sneak peek at AthenaZero, our robotic manipulator built to tackle dynamic tasks like a human arm. Learn more about how this fast, precise robot can switch in an instant from a gentle touch to high force depending on what the task requires:

Happy #NationalRoboticsWeek! Last year, we gave you a sneak peek at AthenaZero, our robotic manipulator built to tackle dynamic tasks like a human arm. Learn more about how this fast, precise robot can switch in an instant from a gentle touch to high force depending on what the task requires:

RAI Institute

23,415 次观看 • 3 个月前