Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Large language models reason through text. Vision‑language‑action models reason through the real world. By fusing perception, context, and action from live video, VLAs deliver the awareness physical AI needs for next‑gen robotics and edge systems.

Intel

4,473,410 subscribers

15,931 views • 5 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

3D-VLA A 3D Vision-Language-Action Generative World Model Recent vision-language-action (VLA) models rely on 2D inputs, lacking integration with the broader realm of the 3D physical world. Furthermore, they perform action prediction by learning a direct mapping from

AK

85,570 views • 2 years ago

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

Google DeepMind introduced two foundational models for embodied reasoning, enabling robots to comprehend, react, and take action in the physical world: ⦿ Gemini Robotics – built on Gemini 2.0. Integrates vision, language, and action for real-world dexterity, . ⦿ Gemini Robotics-ER – Enhances spatial reasoning for advanced robotic control. They are working with Apptronik to develop the next generation of humanoid robots.

The Humanoid Hub

73,097 views • 1 year ago

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Chain-of-thought reasoning is a powerful tool to enable language models to work through complex problems. Can we use this with robots? With embodied chain-of-thought, vision-language-action (VLA) models can think through perception and planning! A 🧵👇

Sergey Levine

30,388 views • 2 years ago

Why LLMs are a dead end for human-level intelligence, and especially for Physical AI / Robotics. The next leap isn’t bigger language models. It’s World Models. I just dropped a full 1-hour presentation from Shanghai: “World Models: the ChatGPT moment for robotics?” → Why LLMs hit a wall → Why action-conditioned world models planning in latent space are the real path → Live World Forge demo with LeWorldModel + Hugging Face LeRobot Watch here. The future of intelligence is embodied, not just chatty.

Why LLMs are a dead end for human-level intelligence, and especially for Physical AI / Robotics. The next leap isn’t bigger language models. It’s World Models. I just dropped a full 1-hour presentation from Shanghai: “World Models: the ChatGPT moment for robotics?” → Why LLMs hit a wall → Why action-conditioned world models planning in latent space are the real path → Live World Forge demo with LeWorldModel + Hugging Face LeRobot Watch here. The future of intelligence is embodied, not just chatty.

abdel

37,349 views • 1 month ago

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

Not the flashiest demos, but what’s under the hood represents a foundational shift for general-purpose robotics. World models are the next-gen foundation of Physical AI, not the VLM backbones found in typical VLAs. DreamZero is a 14B-parameter World Action Model (WAM) by NVIDIA that treats robotics as a joint video-and-action prediction task. Unlike traditional Vision-Language-Action (VLA) models that map images directly to motor commands, DreamZero leverages a pretrained video diffusion backbone to predict future world states and actions simultaneously. - achieves 2× better zero-shot generalization to unseen tasks and environments compared to state-of-the-art VLAs. - learns effectively from heterogeneous, non-repetitive data (500 hours), breaking the need for thousands of repeated demonstrations. - adapts to new robot embodiments with just 30 minutes of play data. - enables 7Hz closed-loop control via system optimizations and "DreamZero-Flash," making high-capacity diffusion models viable for real-time use.

The Humanoid Hub

35,204 views • 5 months ago

Jensen Huang explains that AI has progressed through four major phases. The first wave was Perception AI, where deep learning enabled superhuman vision and speech recognition. The second wave was Generative AI, allowing AI to create text, images, video, and more. We are now in the third wave, Reasoning AI, where models can think through problems, apply logic, and even conduct research. The fourth wave will be Physical AI, where models gain real-world understanding and common sense, leading to advanced robotics.

Jensen Huang explains that AI has progressed through four major phases. The first wave was Perception AI, where deep learning enabled superhuman vision and speech recognition. The second wave was Generative AI, allowing AI to create text, images, video, and more. We are now in the third wave, Reasoning AI, where models can think through problems, apply logic, and even conduct research. The fourth wave will be Physical AI, where models gain real-world understanding and common sense, leading to advanced robotics.

Wes Roth

69,367 views • 1 year ago

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

VLAExplain — Interpreting Vision-Language-Action (VLA) Models VLAExplain is an interpretability toolkit designed to help users visually understand the inner workings of Vision-Language-Action (VLA) models. Currently, attention analysis is supported for both the pi05 and unifolm-vla models. For details, please check pi05 and UnifoLM-VLA readme files respectively. Demo of pi05 in action:

Ryohei Sasaki@engineer

12,774 views • 2 months ago

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Pi0 from Physical Intelligence is one of the best generalist Vision-Language-Action (VLA) models for robotics (and predecessor of Pi0.5). 🧵 Here's how it works + how you can fine-tune it on your own robot for a simple task:

Ilia

32,956 views • 10 months ago

Meet Reka Edge – Our next-generation vision language model for physical AI. Uses 3x fewer input tokens and achieves 65% faster throughput compared to leading 8B models. Image understanding, video analysis, object detection, and tool use. Built for Action. Fast enough for production, deployable anywhere. Read more:

Meet Reka Edge – Our next-generation vision language model for physical AI. Uses 3x fewer input tokens and achieves 65% faster throughput compared to leading 8B models. Image understanding, video analysis, object detection, and tool use. Built for Action. Fast enough for production, deployable anywhere. Read more:

Reka

53,785 views • 4 months ago

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

The next frontier of autonomous driving is unlocked by reasoning models. NVIDIA Alpamayo brings together open AI models with reasoning capabilities, closed-loop simulation tools, and massive real-world driving datasets. Alpamayo 1 is a vision–language–action model that explains its own decisions through explicit reasoning traces, enabling trustworthy, humanlike decision-making. Together with NVIDIA’s Physical AI dataset and AlpaSim simulation, Alpamayo provides the tools and scale required to enable level 4 autonomous vehicles. ▶️ Watch now:

NVIDIA DRIVE

35,324 views • 6 months ago

Yann LeCun says the real world is far more complex than the world of language LLMs can accumulate knowledge, but they fail with high-dimensional, continuous, noisy sensory data "the next revolution is physical AI" Systems that can truly plan, reason, and understand the physical environment

Yann LeCun says the real world is far more complex than the world of language LLMs can accumulate knowledge, but they fail with high-dimensional, continuous, noisy sensory data "the next revolution is physical AI" Systems that can truly plan, reason, and understand the physical environment

Haider.

66,616 views • 5 months ago

Check out this Circulus Pion robot at Computex 2026. It’s running on a Panther Lake processor, using P-cores for real-time control, E-cores to collect sensor data, the GPU for vision-language-action models, and the NPU for continuous perception.

Check out this Circulus Pion robot at Computex 2026. It’s running on a Panther Lake processor, using P-cores for real-time control, E-cores to collect sensor data, the GPU for vision-language-action models, and the NPU for continuous perception.

Intel Technology

10,752 views • 1 month ago

You can't control the uncontrollables. But that doesn't mean you can't control the future of robotics. 𝐓𝐞𝐥𝐞𝐨𝐩 𝐢𝐬 𝐥𝐢𝐯𝐞 𝐨𝐧 𝐏𝐫𝐢𝐬𝐦𝐚𝐗. You can now train vision-language action models and earn PrismaX points for your impact.

You can't control the uncontrollables. But that doesn't mean you can't control the future of robotics. 𝐓𝐞𝐥𝐞𝐨𝐩 𝐢𝐬 𝐥𝐢𝐯𝐞 𝐨𝐧 𝐏𝐫𝐢𝐬𝐦𝐚𝐗. You can now train vision-language action models and earn PrismaX points for your impact.

PrismaX

23,438 views • 11 months ago

Jensen Huang says the industry is shifting from generative AI to agentic AI The next major step is fusing public cloud frontier models with customized open-source systems running on enterprise servers But the end goal is physical: "moving intelligence from the cloud into industrial AI, factories and robotics"

Jensen Huang says the industry is shifting from generative AI to agentic AI The next major step is fusing public cloud frontier models with customized open-source systems running on enterprise servers But the end goal is physical: "moving intelligence from the cloud into industrial AI, factories and robotics"

Haider.

49,411 views • 6 months ago

Yann LeCun says language isn’t intelligence. Predicting text doesn’t mean understanding reality. The real world is messy, physical, and causal and today’s LLMs barely touch that. The next leap is Physical AI: world models, cause and effect, real planning. Do you think LLMs can evolve into this, or do we need a completely new architecture?

Yann LeCun says language isn’t intelligence. Predicting text doesn’t mean understanding reality. The real world is messy, physical, and causal and today’s LLMs barely touch that. The next leap is Physical AI: world models, cause and effect, real planning. Do you think LLMs can evolve into this, or do we need a completely new architecture?

VraserX e/acc

76,178 views • 5 months ago

DynamicVLA A compact 0.4B Vision-Language-Action model that finally lets robots manipulate *moving* objects in real-time, closing the perception-execution gap with Continuous Inference and Latent-aware Action Streaming.

DynamicVLA A compact 0.4B Vision-Language-Action model that finally lets robots manipulate moving objects in real-time, closing the perception-execution gap with Continuous Inference and Latent-aware Action Streaming.

DailyPapers

16,357 views • 5 months ago

Computer vision can see what happened — agentic AI explains why it matters and what to do next. Here’s how teams are upgrading video analytics with vision language models: 🔍 Turn video into searchable intelligence 🧠 Add context and reasoning to system alerts ⚡ Summarize complex scenes and answer questions automatically Read the blog to see 3 real-world ways to bring agentic AI to computer vision →

Computer vision can see what happened — agentic AI explains why it matters and what to do next. Here’s how teams are upgrading video analytics with vision language models: 🔍 Turn video into searchable intelligence 🧠 Add context and reasoning to system alerts ⚡ Summarize complex scenes and answer questions automatically Read the blog to see 3 real-world ways to bring agentic AI to computer vision →

NVIDIA AI

14,273 views • 6 months ago

NEWS: NVIDIA just announced Alpamayo, what CEO Jensen Huang calls the world’s first thinking, reasoning autonomous vehicle AI, launching on U.S. roads later this year, starting with the Mercedes CLA. Jensen: "It's trained end-to-end. Literally from camera in to actuation out; It reasons what action it is about to take, the reason by which is came about that action, and the trajectory." Alpamayo introduces Vision-Language-Action (VLA) models, which enable self-driving systems to interpret what they see, reason about complex driving scenarios, and generate driving actions. The platform includes large reasoning models, simulation tools for testing rare and edge-case scenarios, and open datasets for training and validation. NVIDIA says the approach improves transparency, safety, and robustness in autonomous systems, particularly in complex real-world environments, and supports progress toward higher levels of vehicle autonomy: "With a 10-billion-parameter architecture, Alpamayo 1 uses video input to generate trajectories alongside reasoning traces, showing the logic behind each decision. Developers can adapt Alpamayo 1 into smaller runtime models for vehicle development, or use it as a foundation for AV development tools such as reasoning-based evaluators and auto-labeling systems. Alpamayo 1 provides open model weights and open-source inferencing scripts. Future models in the family will feature larger parameter counts, more detailed reasoning capabilities, more input and output flexibility, and options for commercial usage."

NEWS: NVIDIA just announced Alpamayo, what CEO Jensen Huang calls the world’s first thinking, reasoning autonomous vehicle AI, launching on U.S. roads later this year, starting with the Mercedes CLA. Jensen: "It's trained end-to-end. Literally from camera in to actuation out; It reasons what action it is about to take, the reason by which is came about that action, and the trajectory." Alpamayo introduces Vision-Language-Action (VLA) models, which enable self-driving systems to interpret what they see, reason about complex driving scenarios, and generate driving actions. The platform includes large reasoning models, simulation tools for testing rare and edge-case scenarios, and open datasets for training and validation. NVIDIA says the approach improves transparency, safety, and robustness in autonomous systems, particularly in complex real-world environments, and supports progress toward higher levels of vehicle autonomy: "With a 10-billion-parameter architecture, Alpamayo 1 uses video input to generate trajectories alongside reasoning traces, showing the logic behind each decision. Developers can adapt Alpamayo 1 into smaller runtime models for vehicle development, or use it as a foundation for AV development tools such as reasoning-based evaluators and auto-labeling systems. Alpamayo 1 provides open model weights and open-source inferencing scripts. Future models in the family will feature larger parameter counts, more detailed reasoning capabilities, more input and output flexibility, and options for commercial usage."

Sawyer Merritt

1,603,561 views • 6 months ago