Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

An interactive world model developed by NVIDIA in collaboration with academic partners. - DreamDojo turns egocentric human video data into physical intelligence. - Human data is more scalable than robotics data but lacks action labels. - To solve this, a dedicated action model extracts latent actions by identifying physics... and motion deltas between frames. Training - A massive 44k hours of video data are used for pre-training. - Post-training on small-scale robot datasets maps human physics to specific robot embodiments. - An additional distillation stage converts the model into an autoregressive, few-step diffusion model, enabling real-time, action-controllable simulation. Primary Use Cases - Live Teleoperation: Controlling a robot inside a world simulation in real-time. - Model-based Planning: Previewing and curating the best actions for improved success. - Policy Evaluation: Testing robot policies in realistic, out-of-distribution scenarios. Everything that's open-sourced: weights, code, post-training dataset, eval set, and details to reproduce.show more

The Humanoid Hub

107,377 subscribers

11,575 просмотров • 3 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

A humanoid robot policy trained solely on synthetic data generated by a world model. Research Scientist Joel Jang presents NVIDIA's DreamGen pipeline: ⦿ Post-train the world model Cosmos-Predict2 with a small set of real teleoperation demos. ⦿ Prompt the world model to generate synthetic video data with verbs and scenarios not used in the world model’s post-training. ⦿ Auto-label synthetic video data with action sequences. ⦿ Train robot policies using only synthetic data. That's it. Deploy zero-shot to a real humanoid robot.

A humanoid robot policy trained solely on synthetic data generated by a world model. Research Scientist Joel Jang presents NVIDIA's DreamGen pipeline: ⦿ Post-train the world model Cosmos-Predict2 with a small set of real teleoperation demos. ⦿ Prompt the world model to generate synthetic video data with verbs and scenarios not used in the world model’s post-training. ⦿ Auto-label synthetic video data with action sequences. ⦿ Train robot policies using only synthetic data. That's it. Deploy zero-shot to a real humanoid robot.

The Humanoid Hub

20,968 просмотров • 11 месяцев назад

1/5 🚀 Thrilled to open-source OSCAR 🤖 — an action-conditioned world model for robotics, led by the visiting student in my group Zhuoyuan Wu! It generalizes across different robot embodiments with precise action controllability. All trained on a single GH200 GPU, and outperforms existing open-sourced baselines, which have larger model capacity and need more compute. Everything is public, including training data. 📄 Paper: 🌐 Project: 💻 Code: 🤗 Robot data: 🤗 Human data: 🤗 Weights: #Robotics #WorldModels #AI #OpenSource

1/5 🚀 Thrilled to open-source OSCAR 🤖 — an action-conditioned world model for robotics, led by the visiting student in my group Zhuoyuan Wu! It generalizes across different robot embodiments with precise action controllability. All trained on a single GH200 GPU, and outperforms existing open-sourced baselines, which have larger model capacity and need more compute. Everything is public, including training data. 📄 Paper: 🌐 Project: 💻 Code: 🤗 Robot data: 🤗 Human data: 🤗 Weights: #Robotics #WorldModels #AI #OpenSource

Jun Gao

96,151 просмотров • 8 дней назад

CMU researchers, in collaboration with NVIDIA, present ASAP, a two-stage framework for humanoid robot agility. It pre-trains motion policies on human data, then refines them with real-world corrections using a delta action model, which adjusts for simulation mismatches.

CMU researchers, in collaboration with NVIDIA, present ASAP, a two-stage framework for humanoid robot agility. It pre-trains motion policies on human data, then refines them with real-world corrections using a delta action model, which adjusts for simulation mismatches.

The Humanoid Hub

1,826,507 просмотров • 1 год назад

CMU researchers, in collaboration with NVIDIA, present ASAP, a two-stage framework for humanoid robot agility. It pre-trains motion policies on human data, refines them with real-world corrections using a delta action model, which adjusts for simulation mismatches. Early days.

CMU researchers, in collaboration with NVIDIA, present ASAP, a two-stage framework for humanoid robot agility. It pre-trains motion policies on human data, refines them with real-world corrections using a delta action model, which adjusts for simulation mismatches. Early days.

Brian Roemmele

87,593 просмотров • 1 год назад

Tired of teleoperation? One human video → 1,000s of robot demos. (📍GitHub ) Scaling Robot Data Without Dynamics Simulation or Robot Hardware Real2Render2Real (R2R2R) is a new way to scale robot data without physics simulation or hardware. You take a phone scan + a single monocular human demo. It tracks the motion, renders photorealistic scenes, and generates diverse, robot-agnostic trajectories ready for training. > No teleop, no sim, no robot, just a phone and a video > Train VLA models and diffusion policies directly on the output > Supports multiple robot embodiments with kinematic consistency > 1000s of demos in 1/27 the time of real-world collection Thank you, Max Fu, for sharing!! Project: Paper: Code coming soon: It shows that with the right pipeline, you can scale robot learning data without touching a robot. One of the most interesting directions in scalable robotics today. —— Weekly robotics and AI insights. Subscribe free:

Tired of teleoperation? One human video → 1,000s of robot demos. (📍GitHub ) Scaling Robot Data Without Dynamics Simulation or Robot Hardware Real2Render2Real (R2R2R) is a new way to scale robot data without physics simulation or hardware. You take a phone scan + a single monocular human demo. It tracks the motion, renders photorealistic scenes, and generates diverse, robot-agnostic trajectories ready for training. > No teleop, no sim, no robot, just a phone and a video > Train VLA models and diffusion policies directly on the output > Supports multiple robot embodiments with kinematic consistency > 1000s of demos in 1/27 the time of real-world collection Thank you, Max Fu, for sharing!! Project: Paper: Code coming soon: It shows that with the right pipeline, you can scale robot learning data without touching a robot. One of the most interesting directions in scalable robotics today. —— Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

42,804 просмотров • 4 месяцев назад

Cosmos Policy turns a pretrained video diffusion model into a robot controller. Instead of redesigning the architecture, it injects robot state, actions, and values directly as latent frames inside the video model

Cosmos Policy turns a pretrained video diffusion model into a robot controller. Instead of redesigning the architecture, it injects robot state, actions, and values directly as latent frames inside the video model

Robots Digest 🤖

22,933 просмотров • 4 месяцев назад

Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning *inside* a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation *inside* a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains +17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:

Announcing DreamDojo: our open-source, interactive world model that takes robot motor controls and generates the future in pixels. No engine, no meshes, no hand-authored dynamics. It's Simulation 2.0. Time for robotics to take the bitter lesson pill. Real-world robot learning is bottlenecked by time, wear, safety, and resets. If we want Physical AI to move at pretraining speed, we need a simulator that adapts to pretraining scale with as little human engineering as possible. Our key insights: (1) human egocentric videos are a scalable source of first-person physics; (2) latent actions make them "robot-readable" across different hardware; (3) real-time inference unlocks live teleop, policy eval, and test-time planning inside a dream. We pre-train on 44K hours of human videos: cheap, abundant, and collected with zero robot-in-the-loop. Humans have already explored the combinatorics: we grasp, pour, fold, assemble, fail, retry—across cluttered scenes, shifting viewpoints, changing light, and hour-long task chains—at a scale no robot fleet could match. The missing piece: these videos have no action labels. So we introduce latent actions: a unified representation inferred directly from videos that captures "what changed between world states" without knowing the underlying hardware. This lets us train on any first-person video as if it came with motor commands attached. As a result, DreamDojo generalizes zero-shot to objects and environments never seen in any robot training set, because humans saw them first. Next, we post-train onto each robot to fit its specific hardware. Think of it as separating "how the world looks and behaves" from "how this particular robot actuates." The base model follows the general physical rules, then "snaps onto" the robot's unique mechanics. It's kind of like loading a new character and scene assets into Unreal Engine, but done through gradient descent and generalizes far beyond the post-training dataset. A world simulator is only useful if it runs fast enough to close the loop. We train a real-time version of DreamDojo that runs at 10 FPS, stable for over a minute of continuous rollout. This unlocks exciting possibilities: - Live teleoperation inside a dream. Connect a VR controller, stream actions into DreamDojo, and teleop a virtual robot in real time. We demo this on Unitree G1 with a PICO headset and one RTX 5090. - Policy evaluation. You can benchmark a policy checkpoint in DreamDojo instead of the real world. The simulated success rates strongly correlate with real-world results - accurate enough to rank checkpoints without burning a single motor. - Model-based planning. Sample multiple action proposals → simulate them all in parallel → pick the best future. Gains +17% real-world success out of the box on a fruit packing task. We open-source everything!! Weights, code, post-training dataset, eval set, and whitepaper with tons of details to reproduce. DreamDojo is based on NVIDIA Cosmos, which is open-weight too. 2026 is the year of World Models for physical AI. We want you to build with us. Happy scaling! Links in thread:

Jim Fan

208,911 просмотров • 3 месяцев назад

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

Egocentric Videos for Robot Training 🧵 Researchers at NYU and UC Berkeley published research where they developed a system called EgoZero that trains robots using human demonstration videos recorded with smart glasses. 👓 The system converts first-person human actions into 3D point-based state-action representations. The policies executed on a gripper-equipped robot, achieved a 70% zero-shot success rate across seven manipulation tasks, with only 20 minutes of human data per task. EgoZero stands as one of the solid empirical proofs that egocentric smart-glass video collected from everyday human behavior can serve as powerful, scalable training data for real robot learning. Vincent Liu Ademi Adeniji

VaderResearch

23,055 просмотров • 9 месяцев назад

Today, we're joined by Sergey Levine, associate professor at UC Berkeley EECS and co-founder of Physical Intelligence to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 2:14 - Physical Intelligence 3:47 - Key challenges in robotic learning 6:13 - Reinforcement learning in π0 and robotic foundation models 8:36 - π0 VLM model architecture 15:33 - π0 model recipe 18:39 - Pre-training dataset 22:47 - Post-training 24:23 - Laundry folding demo 31:32 - Scaling laws on π0 model 34:57 - FAST 40:26 - Open sourcing π0 43:37 - Other robot types 46:27 - Future directions

Today, we're joined by Sergey Levine, associate professor at UC Berkeley EECS and co-founder of Physical Intelligence to discuss π0 (pi-zero), a general-purpose robotic foundation model. We dig into the model architecture, which pairs a vision language model (VLM) with a diffusion-based action expert, and the model training "recipe," emphasizing the roles of pre-training and post-training with a diverse mixture of real-world data to ensure robust and intelligent robot learning. We review the data collection approach, which uses human operators and teleoperation rigs, the potential of synthetic data and reinforcement learning in enhancing robotic capabilities, and much more. We also introduce the team’s new FAST tokenizer, which opens the door to a fully Transformer-based model and significant improvements in learning and generalization. Finally, we cover the open-sourcing of π0 and future directions for their research. 🎧 / 🎥 Listen or watch the full episode on our page: 📖 CHAPTERS =============================== 00:00 - Introduction 2:14 - Physical Intelligence 3:47 - Key challenges in robotic learning 6:13 - Reinforcement learning in π0 and robotic foundation models 8:36 - π0 VLM model architecture 15:33 - π0 model recipe 18:39 - Pre-training dataset 22:47 - Post-training 24:23 - Laundry folding demo 31:32 - Scaling laws on π0 model 34:57 - FAST 40:26 - Open sourcing π0 43:37 - Other robot types 46:27 - Future directions

The TWIML AI Podcast

19,942 просмотров • 1 год назад

Happy to share what I’ve been working on since joining Genesis! GENE-26.5 is a one-of-a-kind, robotics-native multimodal foundation model that learns from diverse, in-the-wild data across modalities and outputs actions enabling a 54-DoF robot system to perform the most dexterous, long-horizon manipulation tasks to date—approaching human-level capability. This is the result of innovations across the full stack—data collection and processing, robot systems, model architecture, training strategies, and scalable evaluation infrastructure.

Happy to share what I’ve been working on since joining Genesis! GENE-26.5 is a one-of-a-kind, robotics-native multimodal foundation model that learns from diverse, in-the-wild data across modalities and outputs actions enabling a 54-DoF robot system to perform the most dexterous, long-horizon manipulation tasks to date—approaching human-level capability. This is the result of innovations across the full stack—data collection and processing, robot systems, model architecture, training strategies, and scalable evaluation infrastructure.

Zu Wang

18,402 просмотров • 1 месяц назад

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

465,704 просмотров • 1 год назад

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

Real-world robot data is expensive and slow to collect, creating a major challenge for humanoid development. 🤖 The NVIDIA GR00T N1.6 open vision language action model is pre-trained on a diverse mix of data, including thousands of hours of Stanford Vision and Learning Lab’s BEHAVIOR simulation data, which covers long-horizon everyday manipulation tasks. This diverse training is the key to robust cross-embodiment performance and real-world adaptability. 🌍 Read the blog 🔗

NVIDIA Robotics

13,408 просмотров • 4 месяцев назад

Excited to release τ0-WM: an open-source unified video-action world model for robotic manipulation. It's a 5B-parameter robotic foundation model trained on 27.3K hours of real-robot teleoperation, UMI-style demonstrations, and egocentric interaction videos.

Excited to release τ0-WM: an open-source unified video-action world model for robotic manipulation. It's a 5B-parameter robotic foundation model trained on 27.3K hours of real-robot teleoperation, UMI-style demonstrations, and egocentric interaction videos.

Jianlan Luo

53,748 просмотров • 18 дней назад

The problem with humanoid teleoperation is that it is expensive and difficult to scale Enter NVIDIA's EgoScale: - A VLA model pretrained on thousands hours of egocentric human videos. - Mid-trained via 50 hours of human + 4 hours of robot "play" data for human-robot alignment. - Fine-tuned with very few examples of task-specific robot teleoperation (100 or fewer per task). - Successfully transfers across 5-finger (Sharpa) and 3-finger (Unitree G1) robot hands. - Performance scales predictably as data increases.

The problem with humanoid teleoperation is that it is expensive and difficult to scale Enter NVIDIA's EgoScale: - A VLA model pretrained on thousands hours of egocentric human videos. - Mid-trained via 50 hours of human + 4 hours of robot "play" data for human-robot alignment. - Fine-tuned with very few examples of task-specific robot teleoperation (100 or fewer per task). - Successfully transfers across 5-finger (Sharpa) and 3-finger (Unitree G1) robot hands. - Performance scales predictably as data increases.

The Humanoid Hub

44,441 просмотров • 3 месяцев назад

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The World Model as NEO's Cognitive Core 1X has revealed a major AI development where the NEO humanoid can translate any natural language prompt into robotic action. It demonstrates this capability even for novel tasks, objects, and environments not found in its robot dataset. - the 1X World Model is trained on internet-scale human interaction videos and fine-tuned with robot data to ground its understanding in physics and in NEO's embodiment - from a simple voice or text prompt, the world model generates a visualization of future actions - a built-in inverse dynamics model then translates these into precise motor movements for NEO

The Humanoid Hub

68,453 просмотров • 5 месяцев назад

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

NVIDIA just announced EgoScale 🤖🧠 NVIDIA Research has uncovered a log-linear scaling law for robot dexterity by pretraining VLA models on over 20,000 hours of egocentric human video This massive dataset is 20 times larger than previous efforts and proves that robot intelligence follows a predictable path: the more human data, the lower the loss The secret is a simple recipe combining large-scale human pretraining with a small amount of aligned human-robot mid-training to bridge the gap In testing, this method boosted the average success rate by 54% on a 22-DoF robotic hand compared to policies built without pretraining EgoScale also enables one-shot task adaptation and works across different hardware, suggesting that human motion is a universal motor prior for robots Website: Paper: Source: NVIDIA Research #Robot #Humanoid #Robotics #AI #EmbodiedAI #PhysicalAI #NVIDIA #EgoScale #GR00T

RoboHub🤖

43,727 просмотров • 3 месяцев назад

We got a robot to clean up homes that were never seen in its training data! Our new model, π-0.5, aims to tackle open-world generalization. We took our robot into homes that were not in the training data and asked it to clean kitchens and bedrooms. More below⤵️

We got a robot to clean up homes that were never seen in its training data! Our new model, π-0.5, aims to tackle open-world generalization. We took our robot into homes that were not in the training data and asked it to clean kitchens and bedrooms. More below⤵️

Physical Intelligence

489,741 просмотров • 1 год назад

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

Google presents Genie Generative Interactive Environments introduce Genie, the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos. The model can be prompted to generate an endless variety of action-controllable virtual worlds described through text, synthetic images, photographs, and even sketches. At 11B parameters, Genie can be considered a foundation world model. It is comprised of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple and scalable latent action model. Genie enables users to act in the generated environments on a frame-by-frame basis despite training without any ground-truth action labels or other domain-specific requirements typically found in the world model literature. Further the resulting learned latent action space facilitates training agents to imitate behaviors from unseen videos, opening the path for training generalist agents of the future.

AK

684,281 просмотров • 2 лет назад

Sharing a few more demos of our first #LargeBehaviorModel (LBM) at TRI. 1/🍎This model enables a robot to learned to core and cut an apple into multiple slices autonomously. We trained our diffusion-based LBM on almost 1,700 hours of robot data, conducted 1,800 real-world evaluation rollouts, and ran over 47,000 simulation rollouts to rigorously study its capabilities. Learn more here:

Sharing a few more demos of our first #LargeBehaviorModel (LBM) at TRI. 1/🍎This model enables a robot to learned to core and cut an apple into multiple slices autonomously. We trained our diffusion-based LBM on almost 1,700 hours of robot data, conducted 1,800 real-world evaluation rollouts, and ran over 47,000 simulation rollouts to rigorously study its capabilities. Learn more here:

Zubair Irshad

12,202 просмотров • 10 месяцев назад

Testing robot policies on hardware is slow, expensive and hard to scale. World models offer a promising path to accelerating robot policy development. We're sharing new research from the Runway Robotics team, in which we simulated 8 robot policies inside our General World Model and found 0.95 correlation with real-world results. Those early results point to world model simulation as a practical substitute for hardware evaluation, comparing favorably to existing real-to-sim approaches. Learn more at the link below.

Testing robot policies on hardware is slow, expensive and hard to scale. World models offer a promising path to accelerating robot policy development. We're sharing new research from the Runway Robotics team, in which we simulated 8 robot policies inside our General World Model and found 0.95 correlation with real-world results. Those early results point to world model simulation as a practical substitute for hardware evaluation, comparing favorably to existing real-to-sim approaches. Learn more at the link below.

Runway

13,464 просмотров • 3 месяцев назад