正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

New paper introduces NaVILA, a vision-language-action (VLA) model that integrates high-level visual-language understanding and low-level locomotion control. It enables humanoid or quadruped robots to navigate unseen environments with natural language instructions.

The Humanoid Hub

110,143 subscribers

20,458 次观看 • 1 年前 •via X (Twitter)

科学技术教育

Anya Rossi• Live Now

Private livecam show

4 条评论

The Humanoid Hub 的头像

The Humanoid Hub1 年前

Detailed insights in this thread:

AssemblyAI 的头像

AssemblyAI1 年前

Announcing: Our most advanced speech-to-text model goes beyond accuracy to capture the real-world complexity of human conversation and deliver reliable, source-of-truth audio data. Explore Universal-2 updates 👇

AI Expert Khalid 的头像

AI Expert Khalid1 年前

I'm fascinated by the potential of NaVILA to revolutionize robot navigation! Imagine instructing robots with just natural language. Pure adrenaline!

maru 的头像

maru1 年前

It sounds like

相关视频

LeVERB is a VLA framework for humanoid whole-body control, combining a vision-language model and a low-level controller via a shared latent action space, trained entirely in sim, deployed zero shot.

LeVERB is a VLA framework for humanoid whole-body control, combining a vision-language model and a low-level controller via a shared latent action space, trained entirely in sim, deployed zero shot.

The Humanoid Hub

10,233 次观看 • 1 年前

🔥Character AI in VR Space🔥 We present #SOLAMI, a social vision-language-action (VLA) model that enables 3D autonomous characters with *speech and body language* interaction - Project: - Paper Hugging Face : . Thanks AK !

🔥Character AI in VR Space🔥 We present #SOLAMI, a social vision-language-action (VLA) model that enables 3D autonomous characters with speech and body language interaction - Project: - Paper Hugging Face : . Thanks AK !

Ziwei Liu

27,957 次观看 • 1 年前

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

The Humanoid Hub

21,954 次观看 • 5 个月前

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

🚀🚀🚀 Ever wondered what it takes for robots to handle real-world household tasks? long-horizon execution, deformable object dexterity, and unseen object generalization — meet GR-3, ByteDance Seed’s new Vision-Language-Action (VLA) model! GR-3 is a generalizable Vision-Language-Action (VLA) model with strong capabilities in complex long-horizon tasks. It understands unseen abstract concepts, manipulates deformable objects robustly, and adapts to novel settings with minimal human data. ✨ Generalization: Generalizes well to unseen objects, environments, and even instructions with abstract concepts. ✨ Long-Horizon Manipulation: Completes long-horizon tasks with strong instruction-following capabilities. ✨ Deformable Object Manipulation: Manipulate deformable objects robustly. Project Page: Arxiv: #ByteDance #ByteDanceSeed #GR3 #VLA #Robotics #FoundationModels

Xiao Ma

46,323 次观看 • 11 个月前

Chinese humanoid robotics company LimX Dynamics has unveiled COSA (Cognitive Operating System of Agents). COSA is described as a unified "brain-body" architecture that allows the robot to think and act simultaneously in the real world. It integrates: - high-level cognition (reasoning, planning, adaptation) - and whole-body motion control (low-latency dynamic locomotion, manipulation) It enables the Oli robot to act on natural language instructions and perform tasks that involve both walking and manipulation while being adaptive to interruptions.

Chinese humanoid robotics company LimX Dynamics has unveiled COSA (Cognitive Operating System of Agents). COSA is described as a unified "brain-body" architecture that allows the robot to think and act simultaneously in the real world. It integrates: - high-level cognition (reasoning, planning, adaptation) - and whole-body motion control (low-latency dynamic locomotion, manipulation) It enables the Oli robot to act on natural language instructions and perform tasks that involve both walking and manipulation while being adaptive to interruptions.

The Humanoid Hub

71,132 次观看 • 5 个月前

Big news from the TeleAI embodied intelligence team! They unveiled TextOp, a general cerebellar framework for real-time, text-driven humanoid control. TextOp lets humans control the robot via natural language, dynamically modifying instructions during runtime to instantly generate smooth, whole-body actions. This enables precise and highly versatile control. The framework uses a two-layer architecture: an action diffusion model (High-Level) and a general motion tracking policy (Low-Level). It’s a new paradigm that eliminates the need for pre-recorded scripts or manual programming.

Big news from the TeleAI embodied intelligence team! They unveiled TextOp, a general cerebellar framework for real-time, text-driven humanoid control. TextOp lets humans control the robot via natural language, dynamically modifying instructions during runtime to instantly generate smooth, whole-body actions. This enables precise and highly versatile control. The framework uses a two-layer architecture: an action diffusion model (High-Level) and a general motion tracking policy (Low-Level). It’s a new paradigm that eliminates the need for pre-recorded scripts or manual programming.

RoboHub🤖

23,804 次观看 • 7 个月前

Another day, another humanoid robot from china AGIBOT introduces GO-1, a generalist foundation model that integrates a vision-language model with a latent planner for enhanced long-horizon and dexterous manipulation.

Another day, another humanoid robot from china AGIBOT introduces GO-1, a generalist foundation model that integrates a vision-language model with a latent planner for enhanced long-horizon and dexterous manipulation.

Chubby♨️

23,513 次观看 • 1 年前

NEWS: Humanoid robot startup Figure announced Helix today, their "in-house AI that reasons like a human." "Our robots equipped with Helix can now pick up virtually any household object without any code or prior training. Helix uses a single set of neural network weights to learn all behaviors." "We're introducing Helix, a generalist Vision-Language-Action (VLA) model that unifies perception, language understanding, and learned control to overcome multiple longstanding challenges in robotics."

NEWS: Humanoid robot startup Figure announced Helix today, their "in-house AI that reasons like a human." "Our robots equipped with Helix can now pick up virtually any household object without any code or prior training. Helix uses a single set of neural network weights to learn all behaviors." "We're introducing Helix, a generalist Vision-Language-Action (VLA) model that unifies perception, language understanding, and learned control to overcome multiple longstanding challenges in robotics."

Sawyer Merritt

181,177 次观看 • 1 年前

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

AK

85,570 次观看 • 2 年前

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

1/N Most Vision-Language-Action models need tons of data for finetuning, and still fail for new objects and instructions. Introducing OTTER, a lightweight, easy-to-train model that uses text-aware visual features to nail unseen tasks out of the box! Here's how it works 👇

Fangchen Liu

68,312 次观看 • 1 年前

Today, we announced 𝗥𝗧-𝟮: a first of its kind vision-language-action model to control robots. 🤖 It learns from both web and robotics data and translates this knowledge into generalised instructions. Find out more:

Today, we announced 𝗥𝗧-𝟮: a first of its kind vision-language-action model to control robots. 🤖 It learns from both web and robotics data and translates this knowledge into generalised instructions. Find out more:

Google DeepMind

537,808 次观看 • 2 年前

We’re bringing powerful AI directly onto robots with Gemini Robotics On-Device. 🤖 It’s our first vision-language-action model to help make robots faster, highly efficient, and adaptable to new tasks and environments - without needing a constant internet connection. 🧵

We’re bringing powerful AI directly onto robots with Gemini Robotics On-Device. 🤖 It’s our first vision-language-action model to help make robots faster, highly efficient, and adaptable to new tasks and environments - without needing a constant internet connection. 🧵

Google DeepMind

819,309 次观看 • 1 年前

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Sergey Levine

30,388 次观看 • 2 年前

Day 2 of 3 MLX Releases: Introducing Local Computer-Use 🚀🔥 A powerful tool built with MLX that uses Vision Language models and Voice models to control your Mac through visual understanding, planning and reasoning. Features ⚡️Automate your workflow with natural language 😎 Control your computer “hands-free” This project now supports both: 🤖 Level 1 (GUI Agent) 🧠 Level 2 (Autonomous GUI Agent) Get started: > pip install -U mlx-vlm mlx-audio mlx-whisper Please leave us a star and send a PR :)

Day 2 of 3 MLX Releases: Introducing Local Computer-Use 🚀🔥 A powerful tool built with MLX that uses Vision Language models and Voice models to control your Mac through visual understanding, planning and reasoning. Features ⚡️Automate your workflow with natural language 😎 Control your computer “hands-free” This project now supports both: 🤖 Level 1 (GUI Agent) 🧠 Level 2 (Autonomous GUI Agent) Get started: > pip install -U mlx-vlm mlx-audio mlx-whisper Please leave us a star and send a PR :)

Prince Canuma

45,867 次观看 • 1 年前

Tired of your vision-language-action (VLA) model failing catastrophically in the presence of distractions? Check out BYOVLA: Bring Your Own VLA: a run-time intervention scheme that markedly improves performance with distractor objects and backgrounds.

Tired of your vision-language-action (VLA) model failing catastrophically in the presence of distractions? Check out BYOVLA: Bring Your Own VLA: a run-time intervention scheme that markedly improves performance with distractor objects and backgrounds.

Anirudha Majumdar

17,831 次观看 • 1 年前

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

Ryohei Sasaki@engineer

12,774 次观看 • 2 个月前

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

Turing Post

62,362 次观看 • 5 个月前

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

The Humanoid Hub

73,097 次观看 • 1 年前