Video wird geladen...
Video konnte nicht geladen werden
🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:
22,291 Aufrufe • vor 2 Jahren •via X (Twitter)
6 Kommentare

(1/4) Given a task instruction, COME-robot employs GPT-4V for reasoning and generates a code-based plan. Through feedback obtained from the robot's execution and interaction with the environment, it iteratively updates the subsequent plan or recovers from failures, ultimately accomplishing the given task.

(2/4) The unique properties of COME-robot: Active Perception, Situated Commonsense Reasoning, and Recover from Failure.

(3/4)Some trails of mobile and tabletop manipulation, including these ones recovering from failures. The objects on the table are randomly permutated after each trail.

(4/4) The VLM can provide helpful feedback for visual feedback errors, grasp failures, wrong detection, etc. The following are some examples

cool,did you build yourself the robot or buy it?

I think this is the way we are going to head towards. The foundation models are incredible generalizers and it does not make sense to try to train a robotic perception model or develop a planning algorithm yourself, if the visual foundation model is one API call away. Nice work!


