正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

AK

508,271 subscribers

85,570 次观看 • 2 年前 •via X (Twitter)

科学技术健康养生教育

Anya Rossi• Live Now

Private livecam show

6 条评论

AK 的头像

AK2 年前

perception to action, neglecting the vast dynamics of the world and the relations between actions and dynamics. In contrast, human beings are endowed with world models that depict imagination about future scenarios to plan actions accordingly. To this end, we propose

AK 的头像

AK2 年前

3D-VLA by introducing a new family of embodied foundation models that seamlessly link 3D perception, reasoning, and action through a generative world model. Specifically, 3D-VLA is built on top of a 3D-based large language model (LLM), and a set of interaction tokens is

AK 的头像

AK2 年前

introduced to engage with the embodied environment. Furthermore, to inject generation abilities into the model, we train a series of embodied diffusion models and align them into the LLM for predicting the goal images and point clouds. To train our 3D-VLA, we curate a

AK 的头像

AK2 年前

large-scale 3D embodied instruction dataset by extracting vast 3D-related information from existing robotics datasets. Our experiments on held-in datasets demonstrate that 3D-VLA

AK 的头像

AK2 年前

significantly improves the reasoning, multimodal generation, and planning capabilities in embodied environments, showcasing its potential in real-world applications.

AK 的头像

AK2 年前

paper page:

相关视频

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

Ryohei Sasaki@engineer

12,774 次观看 • 2 个月前

Tired of your vision-language-action (VLA) model failing catastrophically in the presence of distractions? Check out BYOVLA: Bring Your Own VLA: a run-time intervention scheme that markedly improves performance with distractor objects and backgrounds.

Tired of your vision-language-action (VLA) model failing catastrophically in the presence of distractions? Check out BYOVLA: Bring Your Own VLA: a run-time intervention scheme that markedly improves performance with distractor objects and backgrounds.

Anirudha Majumdar

17,831 次观看 • 1 年前

Large language models reason through text. Vision‑language‑action models reason through the real world. By fusing perception, context, and action from live video, VLAs deliver the awareness physical AI needs for next‑gen robotics and edge systems.

Large language models reason through text. Vision‑language‑action models reason through the real world. By fusing perception, context, and action from live video, VLAs deliver the awareness physical AI needs for next‑gen robotics and edge systems.

Intel

15,931 次观看 • 5 个月前

Introducing ♾OmniJARVIS, our latest venture to #AgentGPT, or vision-language-action (VLA) models for open-world instruction-following agents 🦾🕹️ tuning in 👉 by Team CraftJarvis, 🤿⏬

Introducing ♾OmniJARVIS, our latest venture to #AgentGPT, or vision-language-action (VLA) models for open-world instruction-following agents 🦾🕹️ tuning in 👉 by Team CraftJarvis, 🤿⏬

Xiaojian Ma

32,550 次观看 • 2 年前

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Ilia

32,956 次观看 • 10 个月前

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 次观看 • 3 年前

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

LLaVA-3D A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Recent advancements in Large Multimodal Models (LMMs) have greatly enhanced their proficiency in 2D visual understanding tasks, enabling them to effectively process and understand images and videos. However, the development of LMMs with 3D-awareness for 3D scene understanding has been hindered by the lack of large-scale 3D vision-language datasets and powerful 3D encoders. In this paper, we introduce a simple yet effective framework called LLaVA-3D. Leveraging the strong 2D understanding priors from LLaVA, our LLaVA-3D efficiently adapts LLaVA for 3D scene understanding without compromising 2D understanding capabilities. To achieve this, we employ a simple yet effective representation, 3D Patch, which connects 2D CLIP patch features with their corresponding positions in 3D space. By integrating the 3D Patches into 2D LMMs and employing joint 2D and 3D vision-language instruction tuning, we establish a unified architecture for both 2D image understanding and 3D scene understanding. Experimental results show that LLaVA-3D converges 3.5x faster than existing 3D LMMs when trained on 3D vision-language datasets. Moreover, LLaVA-3D not only achieves state-of-the-art performance across various 3D tasks but also maintains comparable 2D image understanding and vision-language conversation capabilities with LLaVA.

AK

41,736 次观看 • 1 年前

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

The next evolution: VLA+ models Just yesterday Microsoft Research released Rho-alpha (ρα) – their first robotics model, built on the Phi family. While most Vision-Language-Action (VLA) models stop at vision and language, Rho-alpha adds: ▪️ Tactile sensing to feel objects during manipulation ▪️ Online learning that lets it improve from human corrections (via teleoperation, 3D mouse or other tools) in real-time even after deployment. Both these sides make adaptability central rather than incidental. Microsoft calls it a VLA+ model, positioning it as an extension beyond what current VLA systems support. ➡️ Today Rho-alpha can control dual-arm robot setups to perform tasks such as: • Manipulating the BusyBox following natural-language instructions • Plug insertion • Toolbox packing and object arrangement with bimanual coordination But to understand why this "plus" matters, we need to understand what came before. Here, we'll take you through the entire landscape of VLA models – Gemini Robotics, π0, SmolVLA, Helix, ACoT-VLA and others:

Turing Post

62,362 次观看 • 6 个月前

Microsoft just dropped VITRA-VLA, a new Vision-Language-Action model for robotics on Hugging Face. It learns dexterous manipulation from over 1 million real-life human hand activity videos.

Microsoft just dropped VITRA-VLA, a new Vision-Language-Action model for robotics on Hugging Face. It learns dexterous manipulation from over 1 million real-life human hand activity videos.

DailyPapers

19,177 次观看 • 7 个月前

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Sergey Levine

30,388 次观看 • 2 年前

New paper introduces NaVILA, a vision-language-action (VLA) model that integrates high-level visual-language understanding and low-level locomotion control. It enables humanoid or quadruped robots to navigate unseen environments with natural language instructions.

New paper introduces NaVILA, a vision-language-action (VLA) model that integrates high-level visual-language understanding and low-level locomotion control. It enables humanoid or quadruped robots to navigate unseen environments with natural language instructions.

The Humanoid Hub

20,462 次观看 • 1 年前

What if robots could improve themselves by learning from their own failures in the real-world? Introducing 𝗣𝗟𝗗 (𝗣𝗿𝗼𝗯𝗲, 𝗟𝗲𝗮𝗿𝗻, 𝗗𝗶𝘀𝘁𝗶𝗹𝗹) — a recipe that enables Vision-Language-Action (VLA) models to self-improve for high-precision manipulation tasks. PLD couples real-world residual reinforcement learning with standard supervised fine-tuning — letting robots discover, recover, and distill their own data flywheel. Quick 🧵

What if robots could improve themselves by learning from their own failures in the real-world? Introducing 𝗣𝗟𝗗 (𝗣𝗿𝗼𝗯𝗲, 𝗟𝗲𝗮𝗿𝗻, 𝗗𝗶𝘀𝘁𝗶𝗹𝗹) — a recipe that enables Vision-Language-Action (VLA) models to self-improve for high-precision manipulation tasks. PLD couples real-world residual reinforcement learning with standard supervised fine-tuning — letting robots discover, recover, and distill their own data flywheel. Quick 🧵

Wenli Xiao

185,017 次观看 • 8 个月前

Robots movements are getting better at a scary pace. Huge real-world datasets and VLA models now turn vision-language into smooth continuous control, diffusion policies generate coherent action sequences, and compliant sensing-actuation reduces jitter.

Robots movements are getting better at a scary pace. Huge real-world datasets and VLA models now turn vision-language into smooth continuous control, diffusion policies generate coherent action sequences, and compliant sensing-actuation reduces jitter.

Rohan Paul

124,385 次观看 • 7 个月前

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

UK-based startup 'Humanoid' announced KinetIQ, an AI framework with a Vision-Language-Action (VLA) model at its core. It uses a four-layer architecture: fleet orchestration, task decomposition, VLA, and RL for whole-body control. It works on both bipedal and wheeled robots.

The Humanoid Hub

21,954 次观看 • 5 个月前

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Can we synthesize 3D human-scene interactions without learning from any 3D data? Yes! Check out Lei Li's GenZI, a novel zero-shot approach to generating 3D interactions by distilling priors from large vision-language models.

Angela Dai

106,862 次观看 • 2 年前

Excited to introduce 𝐋𝐀𝐏𝐀: the first unsupervised pretraining method for Vision-Language-Action models. Outperforms SOTA models trained with ground-truth actions 30x more efficient than conventional VLA pretraining 📝: 🧵 1/9

Excited to introduce 𝐋𝐀𝐏𝐀: the first unsupervised pretraining method for Vision-Language-Action models. Outperforms SOTA models trained with ground-truth actions 30x more efficient than conventional VLA pretraining 📝: 🧵 1/9

Joel Jang

46,018 次观看 • 1 年前

Ever wondered what robots 🤖 could achieve if they could not just see – but also feel and hear? Introducing FuSe: a recipe for finetuning large vision-language-action (VLA) models with heterogeneous sensory data, such as vision, touch, sound, and more. Details in the thread 👇

Ever wondered what robots 🤖 could achieve if they could not just see – but also feel and hear? Introducing FuSe: a recipe for finetuning large vision-language-action (VLA) models with heterogeneous sensory data, such as vision, touch, sound, and more. Details in the thread 👇

Carlo Sferrazza

46,111 次观看 • 1 年前

LeVERB is a VLA framework for humanoid whole-body control, combining a vision-language model and a low-level controller via a shared latent action space, trained entirely in sim, deployed zero shot.

LeVERB is a VLA framework for humanoid whole-body control, combining a vision-language model and a low-level controller via a shared latent action space, trained entirely in sim, deployed zero shot.

The Humanoid Hub

10,233 次观看 • 1 年前

Exciting progress on Vision-Language-Action models from a collaboration between San Francisco-based Physical Intelligence (π) and China’s AGIBOT: (π)’s single model can autonomously perform diverse tasks on the AGIBOT G1 robot, using both humanoid hands and two-finger grippers.

Exciting progress on Vision-Language-Action models from a collaboration between San Francisco-based Physical Intelligence (π) and China’s AGIBOT: (π)’s single model can autonomously perform diverse tasks on the AGIBOT G1 robot, using both humanoid hands and two-finger grippers.

The Humanoid Hub

32,575 次观看 • 1 年前

🚀 MiniCPM enters the physical world — enabling robots to understand, remember, and act. We open-source MiniCPM-Robot, our first embodied AI model series, including: 🤖 MiniCPM-RobotManip — a 1.5B general-purpose Vision-Language-Action (VLA) model for robotic manipulation. 🐕 MiniCPM-RobotTrack — a compact model for real-world target tracking. ⚡ PhyAI — a high-performance inference framework built for embodied models. Together, they bring efficient, practical, and open embodied intelligence closer to real-world robots. ⭐ GitHub: 🤗 MiniCPM-RobotManip: 🤗 MiniCPM-RobotTrack:

🚀 MiniCPM enters the physical world — enabling robots to understand, remember, and act. We open-source MiniCPM-Robot, our first embodied AI model series, including: 🤖 MiniCPM-RobotManip — a 1.5B general-purpose Vision-Language-Action (VLA) model for robotic manipulation. 🐕 MiniCPM-RobotTrack — a compact model for real-world target tracking. ⚡ PhyAI — a high-performance inference framework built for embodied models. Together, they bring efficient, practical, and open embodied intelligence closer to real-world robots. ⭐ GitHub: 🤗 MiniCPM-RobotManip: 🤗 MiniCPM-RobotTrack:

OpenBMB

317,092 次观看 • 3 天前