Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

Siyuan Huang

3,619 subscribers

22,291 Aufrufe • vor 2 Jahren •via X (Twitter)

Gesundheit & Wellness Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

6 Kommentare

Profilbild von Siyuan Huang

Siyuan Huangvor 2 Jahren

(1/4) Given a task instruction, COME-robot employs GPT-4V for reasoning and generates a code-based plan. Through feedback obtained from the robot's execution and interaction with the environment, it iteratively updates the subsequent plan or recovers from failures, ultimately accomplishing the given task.

Profilbild von Siyuan Huang

Siyuan Huangvor 2 Jahren

(2/4) The unique properties of COME-robot: Active Perception, Situated Commonsense Reasoning, and Recover from Failure.

Profilbild von Siyuan Huang

Siyuan Huangvor 2 Jahren

（3/4）Some trails of mobile and tabletop manipulation, including these ones recovering from failures. The objects on the table are randomly permutated after each trail.

Profilbild von Siyuan Huang

Siyuan Huangvor 2 Jahren

(4/4) The VLM can provide helpful feedback for visual feedback errors, grasp failures, wrong detection, etc. The following are some examples

Profilbild von jack

jackvor 2 Jahren

cool，did you build yourself the robot or buy it?

Profilbild von Markus Heimerl

Markus Heimerlvor 2 Jahren

I think this is the way we are going to head towards. The foundation models are incredible generalizers and it does not make sense to try to train a robotic perception model or develop a planning algorithm yourself, if the visual foundation model is one API call away. Nice work!

Ähnliche Videos

Excited to release OK-Robot, an open-vocabulary mobile-manipulator for homes. Simply tell the robot what to pick and where to drop it in natural language, and it will do it. Like: Me: "OK Robot, move the Takis from the desk to the nightstand" Robot: ⬇️

Excited to release OK-Robot, an open-vocabulary mobile-manipulator for homes. Simply tell the robot what to pick and where to drop it in natural language, and it will do it. Like: Me: "OK Robot, move the Takis from the desk to the nightstand" Robot: ⬇️

Lerrel Pinto

152,251 Aufrufe • vor 2 Jahren

The future of robot butlers starts with mobile manipulation. We’re announcing the NeurIPS 2023 Open-Vocabulary Mobile Manipulation Challenge! - Full robot stack ✅ - Parallel sim and real evaluation ✅ - No robot required ✅👀

The future of robot butlers starts with mobile manipulation. We’re announcing the NeurIPS 2023 Open-Vocabulary Mobile Manipulation Challenge! - Full robot stack ✅ - Parallel sim and real evaluation ✅ - No robot required ✅👀

Chris Paxton

178,966 Aufrufe • vor 3 Jahren

NVIDIA Isaac GR00T N1 is an open generalist foundation model for #humanoidrobots. 🤖 Discover how the model can easily generalize common tasks, boosting robot reasoning and skills. ▶️

NVIDIA Isaac GR00T N1 is an open generalist foundation model for #humanoidrobots. 🤖 Discover how the model can easily generalize common tasks, boosting robot reasoning and skills. ▶️

NVIDIA Robotics

53,401 Aufrufe • vor 1 Jahr

Gen2Act: Casting language-conditioned manipulation as *human video generation* followed by *closed-loop policy execution conditioned on the generated video* enables solving diverse real-world tasks unseen in the robot dataset! 1/n

Gen2Act: Casting language-conditioned manipulation as human video generation followed by closed-loop policy execution conditioned on the generated video enables solving diverse real-world tasks unseen in the robot dataset! 1/n

Homanga Bharadhwaj

71,143 Aufrufe • vor 1 Jahr

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

Hongyu Li

105,471 Aufrufe • vor 8 Monaten

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Robots Digest 🤖

14,663 Aufrufe • vor 1 Monat

CMU Vision-Language-Autonomy update: The team just released SORT3D, the first general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems! 🤖👀 Simulation and real-robot data is provided!:

CMU Vision-Language-Autonomy update: The team just released SORT3D, the first general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems! 🤖👀 Simulation and real-robot data is provided!:

CMU Robotics Institute

18,234 Aufrufe • vor 1 Jahr

Announcing the first production robot navigation framework on $500 hardware Explore the world once → your robot agent will relocalize and build a persistant, spatial memory across sessions SLAM, relocalization, loop closure, map i/o, planning, control No ROS. Open source.

Announcing the first production robot navigation framework on $500 hardware Explore the world once → your robot agent will relocalize and build a persistant, spatial memory across sessions SLAM, relocalization, loop closure, map i/o, planning, control No ROS. Open source.

stash

53,361 Aufrufe • vor 3 Tagen

🤖THIS VACUUM ROBOT IS A REAL “GOOD BOY” Designed to pick up cigarette butts, the four-legged robot uses vacuum-powered feet to collect litter with every step. In testing, it achieved a 90% success rate.

🤖THIS VACUUM ROBOT IS A REAL “GOOD BOY” Designed to pick up cigarette butts, the four-legged robot uses vacuum-powered feet to collect litter with every step. In testing, it achieved a 90% success rate.

Coin Bureau

24,807 Aufrufe • vor 17 Tagen

LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM LEGO-SLAM running at 15 FPS on a ScanNet scene with language-based loop closing for drift correction. LEGO-SLAM is a 3DGS-based SLAM framework that supports open-vocabulary semantic querying and rendering. It tracks via G-ICP and efficiently builds a map by embedding Gaussians with scene-adaptive 16D language features. Map management is achieved through Language Pruning and Language-Based Loop Detection. The generated map enables open-vocabulary 3D Object Localization.

LEGO-SLAM: Language-Embedded Gaussian Optimization SLAM LEGO-SLAM running at 15 FPS on a ScanNet scene with language-based loop closing for drift correction. LEGO-SLAM is a 3DGS-based SLAM framework that supports open-vocabulary semantic querying and rendering. It tracks via G-ICP and efficiently builds a map by embedding Gaussians with scene-adaptive 16D language features. Map management is achieved through Language Pruning and Language-Based Loop Detection. The generated map enables open-vocabulary 3D Object Localization.

Ryohei Sasaki@engineer

14,935 Aufrufe • vor 3 Monaten

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

NVIDIA Robotics

13,429 Aufrufe • vor 5 Monaten

LMDrive Closed-Loop End-to-End Driving with LLM An end-to-end, closed-loop, language-based autonomous driving framework, which interacts with the dynamic environment via multi-modal multi-view sensor data and natural language instructions

LMDrive Closed-Loop End-to-End Driving with LLM An end-to-end, closed-loop, language-based autonomous driving framework, which interacts with the dynamic environment via multi-modal multi-view sensor data and natural language instructions

rsasaki0109

10,085 Aufrufe • vor 2 Jahren

🚨🚨Can agents earn money, run a business, or even self-organize a society in the physical social world? 🤖🤖 Can agents learn continually to survive and thrive in embodied environments, like how human babies grow? 👶 Super excited to introduce SimWorld, an open-ended simulator of LLM agents in infinite, realistic embodied worlds. SimWorld features 3 key designs: 1⃣Open-ended realistic world simulation - built on Unreal Engine 5, with accurate physical social dynamics - 100+ built-in environments (city, island, wilderness ...) - language-controllable procedural generation - text-to-3D asset generation 2⃣Native interface for LLM/VLM agents - Gym-like agent-environment interaction APIs - plug in any LLMs/VLMs (GPTs, Gemini, Qwen ...) - rich multi-modal perception - open-vocabulary natural-language action outputs 3⃣Diverse physical and social reasoning scenarios - long-horizon embodied reasoning - multi-agent collaboration / competition - easily customizable for any reasoning tasks SimWorld is fully open-sourced, with a hope to become a foundational infrastructure for real-world agent research across disciplines: robotics, economy, public health, education, etc. Project website + more details in the thread👇 ...1/

🚨🚨Can agents earn money, run a business, or even self-organize a society in the physical social world? 🤖🤖 Can agents learn continually to survive and thrive in embodied environments, like how human babies grow? 👶 Super excited to introduce SimWorld, an open-ended simulator of LLM agents in infinite, realistic embodied worlds. SimWorld features 3 key designs: 1⃣Open-ended realistic world simulation - built on Unreal Engine 5, with accurate physical social dynamics - 100+ built-in environments (city, island, wilderness ...) - language-controllable procedural generation - text-to-3D asset generation 2⃣Native interface for LLM/VLM agents - Gym-like agent-environment interaction APIs - plug in any LLMs/VLMs (GPTs, Gemini, Qwen ...) - rich multi-modal perception - open-vocabulary natural-language action outputs 3⃣Diverse physical and social reasoning scenarios - long-horizon embodied reasoning - multi-agent collaboration / competition - easily customizable for any reasoning tasks SimWorld is fully open-sourced, with a hope to become a foundational infrastructure for real-world agent research across disciplines: robotics, economy, public health, education, etc. Project website + more details in the thread👇 ...1/

Lianhui Qin

64,864 Aufrufe • vor 7 Monaten

State-of-the-art robot policies often need hundreds of hours of data. What if we needed none? Introducing TiPToP: a manipulation system that zero-shots open-world tasks from pixels and language using vision foundation models and GPU-parallelized Task and Motion Planning (TAMP).

State-of-the-art robot policies often need hundreds of hours of data. What if we needed none? Introducing TiPToP: a manipulation system that zero-shots open-world tasks from pixels and language using vision foundation models and GPU-parallelized Task and Motion Planning (TAMP).

Nishanth Kumar

77,488 Aufrufe • vor 3 Monaten

Introduce Open-𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧🤖: ⁣ We need an intuitive and remote teleoperation interface to collect more robot data. 𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧 lets you immersively operate a robot even if you are 3000 miles away, like in the movie 𝘈𝘷𝘢𝘵𝘢𝘳. Open-sourced!

Introduce Open-𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧🤖: ⁣ We need an intuitive and remote teleoperation interface to collect more robot data. 𝐓𝐞𝐥𝐞𝐕𝐢𝐬𝐢𝐨𝐧 lets you immersively operate a robot even if you are 3000 miles away, like in the movie 𝘈𝘷𝘢𝘵𝘢𝘳. Open-sourced!

Xuxin Cheng

329,449 Aufrufe • vor 2 Jahren

Excited to release τ0-WM: an open-source unified video-action world model for robotic manipulation. It's a 5B-parameter robotic foundation model trained on 27.3K hours of real-robot teleoperation, UMI-style demonstrations, and egocentric interaction videos.

Excited to release τ0-WM: an open-source unified video-action world model for robotic manipulation. It's a 5B-parameter robotic foundation model trained on 27.3K hours of real-robot teleoperation, UMI-style demonstrations, and egocentric interaction videos.

Jianlan Luo

54,352 Aufrufe • vor 1 Monat

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

NVIDIA DRIVE

35,324 Aufrufe • vor 5 Monaten

What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website:

What happens when we train the largest vision-language model and add in robot experiences? The result is PaLM-E 🌴🤖, a 562-billion parameter, general-purpose, embodied visual-language generalist - across robotics, vision, and language. Website:

Danny Driess

1,272,497 Aufrufe • vor 3 Jahren

Chinese humanoid robotics company LimX Dynamics has unveiled COSA (Cognitive Operating System of Agents). COSA is described as a unified "brain-body" architecture that allows the robot to think and act simultaneously in the real world. It integrates: - high-level cognition (reasoning, planning, adaptation) - and whole-body motion control (low-latency dynamic locomotion, manipulation) It enables the Oli robot to act on natural language instructions and perform tasks that involve both walking and manipulation while being adaptive to interruptions.

Chinese humanoid robotics company LimX Dynamics has unveiled COSA (Cognitive Operating System of Agents). COSA is described as a unified "brain-body" architecture that allows the robot to think and act simultaneously in the real world. It integrates: - high-level cognition (reasoning, planning, adaptation) - and whole-body motion control (low-latency dynamic locomotion, manipulation) It enables the Oli robot to act on natural language instructions and perform tasks that involve both walking and manipulation while being adaptive to interruptions.

The Humanoid Hub

71,132 Aufrufe • vor 5 Monaten

Defining the Future of Motion: BFM-2 Foundation Model & The Era of Robot "Muscle Memory" AGIBOT introduces BFM-2, a two-stage Motion-Between locomotion foundation model. It delivers autonomous, stable interpolation and closed-loop dynamic task execution across any state—static, preset, or random. The ultimate robust foundation for Embodied Intelligence. #AGIBOT #Humanoid #BFM #EmbodiedAI #PhysicalAI

Defining the Future of Motion: BFM-2 Foundation Model & The Era of Robot "Muscle Memory" AGIBOT introduces BFM-2, a two-stage Motion-Between locomotion foundation model. It delivers autonomous, stable interpolation and closed-loop dynamic task execution across any state—static, preset, or random. The ultimate robust foundation for Embodied Intelligence. #AGIBOT #Humanoid #BFM #EmbodiedAI #PhysicalAI

AGIBOT

18,521 Aufrufe • vor 1 Monat