正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot... control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →show more

Lukas Ziegler

58,003 subscribers

50,063 次观看 • 7 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Turning video into humanoid robot motion! 🤳🏼 Training humanoid robots needs huge amounts of motion data, but real-world capture doesn’t scale. Mocap is expensive, dangerous edge cases are rare, and you can’t ask humans to repeatedly fall or crash. Video2Robot tackles this by converting videos into physics-grounded humanoid simulations. Motion is generated to respect balance, inertia, ground contact, and joint limits, then directly retargeted to robot simulators. One prompt can generate a full humanoid motion sequence, including multi-agent interactions and failure cases like falls or collisions, scenarios that are hard or impossible to capture safely in the real world. The pipeline is model-agnostic and works with existing video generators, making it a practical way to scale data for robots. If robots are going to operate in the real world, they need to be trained on the failures too, not just the perfect demos. Here's the GitHub: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Turning video into humanoid robot motion! 🤳🏼 Training humanoid robots needs huge amounts of motion data, but real-world capture doesn’t scale. Mocap is expensive, dangerous edge cases are rare, and you can’t ask humans to repeatedly fall or crash. Video2Robot tackles this by converting videos into physics-grounded humanoid simulations. Motion is generated to respect balance, inertia, ground contact, and joint limits, then directly retargeted to robot simulators. One prompt can generate a full humanoid motion sequence, including multi-agent interactions and failure cases like falls or collisions, scenarios that are hard or impossible to capture safely in the real world. The pipeline is model-agnostic and works with existing video generators, making it a practical way to scale data for robots. If robots are going to operate in the real world, they need to be trained on the failures too, not just the perfect demos. Here's the GitHub: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

59,708 次观看 • 7 个月前

Introducing FLUX-mimic, a next-generation Video-Action Model for general purpose dexterity, developed in partnership with Black Forest Labs. Late last year we published mimic-video and introduced Video-Action Models (VAM): a new family of robotics foundation models built on top of video generation models. We showed that robot control reduces to visual prediction, and that robot capability is downstream of improvements in video modeling accuracy. The obvious implication was that advances in the video modeling frontier would directly translate to increased capabilities in end-to-end robot learning. FLUX-mimic is that thesis at frontier scale: We've applied our VAM architecture to the strongest video backbone available today, FLUX 3 from Black Forest Labs, and trained it on data from our own robots and wearables. General-purpose dexterity, running on a single GPU on premises. Because the model already understands world dynamics, it needs far fewer demonstrations to learn a new task. This is game-changing for our mission to deploy robots to factory floors, where industrial robot data is scarce and expensive to collect. We're now testing and deploying FLUX-mimic with manufacturing leaders like Audi USA, on complex, multi-step manipulation long considered impossible for conventional automation.

Introducing FLUX-mimic, a next-generation Video-Action Model for general purpose dexterity, developed in partnership with Black Forest Labs. Late last year we published mimic-video and introduced Video-Action Models (VAM): a new family of robotics foundation models built on top of video generation models. We showed that robot control reduces to visual prediction, and that robot capability is downstream of improvements in video modeling accuracy. The obvious implication was that advances in the video modeling frontier would directly translate to increased capabilities in end-to-end robot learning. FLUX-mimic is that thesis at frontier scale: We've applied our VAM architecture to the strongest video backbone available today, FLUX 3 from Black Forest Labs, and trained it on data from our own robots and wearables. General-purpose dexterity, running on a single GPU on premises. Because the model already understands world dynamics, it needs far fewer demonstrations to learn a new task. This is game-changing for our mission to deploy robots to factory floors, where industrial robot data is scarce and expensive to collect. We're now testing and deploying FLUX-mimic with manufacturing leaders like Audi USA, on complex, multi-step manipulation long considered impossible for conventional automation.

mimic

113,932 次观看 • 6 天前

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about @mimicrobotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about @mimicrobotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

RoboPapers

48,343 次观看 • 2 个月前

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

60,912 次观看 • 6 个月前

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 次观看 • 3 年前

Brain-controlled exoskeletons to train humanoid robots! 🧠 Fourier just presented human tele-operators using brain control interfaces and exoskeletal arms to train humanoid robots on home tasks. The brain control interface is the interesting part. Instead of using a controller or joystick to teleoperate, the operator's movements and intentions are captured more naturally through the exoskeleton and BCI. This means the demonstrations are more fluid, more human-like, and better suited for training robots to perform delicate home tasks. Multiple tele-operators are simultaneously generating training data across multiple robots. This is how you build the dataset needed for eventual full autonomy, without waiting years for it to arrive. This might be the bridge between "robots that work in controlled environments" and "robots that work in homes." Not full autonomy right away, but trusted human intelligence operating through a robot body, getting better with every task completed. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Brain-controlled exoskeletons to train humanoid robots! 🧠 Fourier just presented human tele-operators using brain control interfaces and exoskeletal arms to train humanoid robots on home tasks. The brain control interface is the interesting part. Instead of using a controller or joystick to teleoperate, the operator's movements and intentions are captured more naturally through the exoskeleton and BCI. This means the demonstrations are more fluid, more human-like, and better suited for training robots to perform delicate home tasks. Multiple tele-operators are simultaneously generating training data across multiple robots. This is how you build the dataset needed for eventual full autonomy, without waiting years for it to arrive. This might be the bridge between "robots that work in controlled environments" and "robots that work in homes." Not full autonomy right away, but trusted human intelligence operating through a robot body, getting better with every task completed. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

20,898 次观看 • 5 个月前

Breaking news: Cosmos 3 is here. They are attempting to do something completely new 🤯 Why is Physical AI much harder than building a chatbot? Understanding the world is not enough, robots need to predict it and act inside it. That's the idea behind NVIDIA Cosmos 3: → Reasoning model understands what's happening from video, images, text, and actions. → World model generates future states of the environment. → Action model generates the actions needed to achieve a goal. Previous systems often stitched these capabilities together using separate models. Cosmos 3 combines them into a single architecture with two components: ▪️ Reasoner Tower Analyzes observations and builds an understanding of objects, motion, interactions, and physical context. ▪️ Generator Tower Uses that understanding to generate future videos and action sequences that obey physical constraints. So Cosmos 3 moves from: Perception → Model A Prediction → Model B Actions → Model C to: Perception + Prediction + Actions → One unified system. The goal is to make robots, autonomous vehicles, and smart environments better at answering three questions: 1. What is happening? 2. What will happen next? 3. What should I do? That's a big shift from today's AI, which mostly focuses on generating text. And check the benchmarks! Physical AI needs to generate decisions that survive contact with the real world. 🚗🤖

Breaking news: Cosmos 3 is here. They are attempting to do something completely new 🤯 Why is Physical AI much harder than building a chatbot? Understanding the world is not enough, robots need to predict it and act inside it. That's the idea behind NVIDIA Cosmos 3: → Reasoning model understands what's happening from video, images, text, and actions. → World model generates future states of the environment. → Action model generates the actions needed to achieve a goal. Previous systems often stitched these capabilities together using separate models. Cosmos 3 combines them into a single architecture with two components: ▪️ Reasoner Tower Analyzes observations and builds an understanding of objects, motion, interactions, and physical context. ▪️ Generator Tower Uses that understanding to generate future videos and action sequences that obey physical constraints. So Cosmos 3 moves from: Perception → Model A Prediction → Model B Actions → Model C to: Perception + Prediction + Actions → One unified system. The goal is to make robots, autonomous vehicles, and smart environments better at answering three questions: 1. What is happening? 2. What will happen next? 3. What should I do? That's a big shift from today's AI, which mostly focuses on generating text. And check the benchmarks! Physical AI needs to generate decisions that survive contact with the real world. 🚗🤖

Turing Post

13,755 次观看 • 1 个月前

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

466,261 次观看 • 1 年前

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

AI at Meta

310,120 次观看 • 1 年前

Most video-action robot models are a content-creation video generator with an action module attached. LingBot-VA 2.0 from Robbyant, a video-action foundation model, throws that starting point out and trains the whole stack natively for control. And it runs closed-loop at a peak 225 Hz. It's so important because A robot cannot move responsively when its controller pauses to imagine the next few frames. LingBot-VA 2.0 predicts during execution, then corrects using each real observation. And it carries only about 13B video parameters while activating roughly 1.9B per token. Bigger robot models usually mean slower reactions, creating a direct conflict between intelligence and control. LingBot-VA 2.0 is trained from scratch for robot control rather than adapted from a video generator built for content creation. Robbyant, an embodied AI company under Ant Group, built it to learn how scenes change under actions, predict what should happen next, and turn those predictions into real-time robot movements. Most video-action systems inherit a tokenizer and video backbone trained mainly to reproduce visual appearance. LingBot-VA 2.0 rebuilds both parts around physical control. Its semantic visual-action tokenizer maps observations toward features from a frozen vision foundation model and learns compact latent actions from frame-to-frame changes using self-supervised inverse and forward dynamics. Unlabeled web video can therefore carry action-relevant training signals without robot action labels. The policy is causal from the start, so every prediction can use only past observations. Its sparse Mixture-of-Experts video backbone has about 13B total parameters, while about 1.9B are active per token, keeping the compute lower during each step. A high-level vision-language planner breaks long tasks into smaller instructions, while the low-level video-action policy handles continuous movement. Foresight Reasoning predicts future visual states while the robot is already acting, then replaces imagined states with every new real observation. Combined with few-step distillation and systems acceleration, the paper reports a peak asynchronous execution frequency of 225 Hz. The model adapts from 10–15 demonstrations, transfers across robot embodiments, and handles some new tasks zero-shot. In the paper’s own evaluations, it reaches 93.6 average on RoboTwin 2.0 and reports stronger real-world results than LingBot-VA and π0.5 across the tested tasks. 🧵 1.

Most video-action robot models are a content-creation video generator with an action module attached. LingBot-VA 2.0 from Robbyant, a video-action foundation model, throws that starting point out and trains the whole stack natively for control. And it runs closed-loop at a peak 225 Hz. It's so important because A robot cannot move responsively when its controller pauses to imagine the next few frames. LingBot-VA 2.0 predicts during execution, then corrects using each real observation. And it carries only about 13B video parameters while activating roughly 1.9B per token. Bigger robot models usually mean slower reactions, creating a direct conflict between intelligence and control. LingBot-VA 2.0 is trained from scratch for robot control rather than adapted from a video generator built for content creation. Robbyant, an embodied AI company under Ant Group, built it to learn how scenes change under actions, predict what should happen next, and turn those predictions into real-time robot movements. Most video-action systems inherit a tokenizer and video backbone trained mainly to reproduce visual appearance. LingBot-VA 2.0 rebuilds both parts around physical control. Its semantic visual-action tokenizer maps observations toward features from a frozen vision foundation model and learns compact latent actions from frame-to-frame changes using self-supervised inverse and forward dynamics. Unlabeled web video can therefore carry action-relevant training signals without robot action labels. The policy is causal from the start, so every prediction can use only past observations. Its sparse Mixture-of-Experts video backbone has about 13B total parameters, while about 1.9B are active per token, keeping the compute lower during each step. A high-level vision-language planner breaks long tasks into smaller instructions, while the low-level video-action policy handles continuous movement. Foresight Reasoning predicts future visual states while the robot is already acting, then replaces imagined states with every new real observation. Combined with few-step distillation and systems acceleration, the paper reports a peak asynchronous execution frequency of 225 Hz. The model adapts from 10–15 demonstrations, transfers across robot embodiments, and handles some new tasks zero-shot. In the paper’s own evaluations, it reaches 93.6 average on RoboTwin 2.0 and reports stronger real-world results than LingBot-VA and π0.5 across the tested tasks. 🧵 1.

Rohan Paul

10,996 次观看 • 16 天前

I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5! DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process. At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation. What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world. Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day. The future of robot data is generative. The future of the entire robot learning pipeline will also be generative. 🧵

I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5! DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process. At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation. What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world. Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day. The future of robot data is generative. The future of the entire robot learning pipeline will also be generative. 🧵

Jim Fan

165,246 次观看 • 1 年前

FRIEDBERG: VIDEO DATA WILL POWER THE NEXT GENERATION OF AI Friedberg broke down the scale shift coming to artificial intelligence, arguing that text based models like GPT are just the beginning, and that the real revolution will come from video-trained systems: “The internet and all these LLMs are language models trained on text from the internet, around 50 billion words total, maybe one to five terabytes of data in their training sets. But if you look at the video data out there, there are hundreds of billions of hours, much of it on YouTube. By some estimates, there’s a thousand exabytes of video data on the internet, about a billion times more than text data. I think we just saw that play out with the new video model that launched yesterday. Google has all this YouTube data, whether or not they’re using it to train, I don’t know. I’ve heard from insiders they’re not allowed to yet and would have to redo the terms of service.” Source: AIFinInsights david friedberg

FRIEDBERG: VIDEO DATA WILL POWER THE NEXT GENERATION OF AI Friedberg broke down the scale shift coming to artificial intelligence, arguing that text based models like GPT are just the beginning, and that the real revolution will come from video-trained systems: “The internet and all these LLMs are language models trained on text from the internet, around 50 billion words total, maybe one to five terabytes of data in their training sets. But if you look at the video data out there, there are hundreds of billions of hours, much of it on YouTube. By some estimates, there’s a thousand exabytes of video data on the internet, about a billion times more than text data. I think we just saw that play out with the new video model that launched yesterday. Google has all this YouTube data, whether or not they’re using it to train, I don’t know. I’ve heard from insiders they’re not allowed to yet and would have to redo the terms of service.” Source: AIFinInsights david friedberg

Mario Nawfal

15,855 次观看 • 7 个月前

New Gemini Robotics 1.5 models will enable robots to better reason, plan ahead, use digital tools like Search, and transfer learning from one kind of robot to another. Our next big step towards general-purpose robots that are truly helpful — you can see how the robot reasons as it sorts laundry in the video below.

New Gemini Robotics 1.5 models will enable robots to better reason, plan ahead, use digital tools like Search, and transfer learning from one kind of robot to another. Our next big step towards general-purpose robots that are truly helpful — you can see how the robot reasons as it sorts laundry in the video below.

Sundar Pichai

496,365 次观看 • 10 个月前

Learning from robot data? Standard. Direct Video-Action Models (DVA) is different: treat robot control as video generation, then translate the generated video into actions. Built by , the system pre-trains causal video models from scratch and can run complex production tasks for hours using only ~10 hours of robot data. • hundreds of frames of visual context • real-time control via causal video prediction More: The team behind it just exited 18 months of stealth with a $450M Series A at a $1.7B valuation. Founded by Jagdeep Singh (ex-QuantumScape) with a Stanford-heavy science team: CSO Eric Ryan Chan (ex-WorldLabs) and Prof. Gordon Wetzstein. Already running in large-scale automotive production environments. Backed by Vinod Khosla Ventures, Temasek, Premji Invest, and John Doerr. Thanks for sharing, Tongzhou Mu 🤖🦾🦿 👋

Learning from robot data? Standard. Direct Video-Action Models (DVA) is different: treat robot control as video generation, then translate the generated video into actions. Built by , the system pre-trains causal video models from scratch and can run complex production tasks for hours using only ~10 hours of robot data. • hundreds of frames of visual context • real-time control via causal video prediction More: The team behind it just exited 18 months of stealth with a $450M Series A at a $1.7B valuation. Founded by Jagdeep Singh (ex-QuantumScape) with a Stanford-heavy science team: CSO Eric Ryan Chan (ex-WorldLabs) and Prof. Gordon Wetzstein. Already running in large-scale automotive production environments. Backed by Vinod Khosla Ventures, Temasek, Premji Invest, and John Doerr. Thanks for sharing, Tongzhou Mu 🤖🦾🦿 👋

Ilir Aliu

26,391 次观看 • 4 个月前

A simple idea. Let robots collect the data that current foundation models are missing. A robot that gets better by doing real work in the real world. For two weeks in the Stanford East Asia Library, Scanford scanned shelves, helped librarians, and improved the vision language model it depends on. The idea is very simple: Robots do useful work. They gather the real world data foundation models never see online. They fine tune their own model They go back out stronger A full loop. What they found in deployment: ✅ 2103 shelves scanned with multilingual, faded, occluded book spines ✅ 18.7 hours of librarian time saved ✅ Book ID accuracy jumped from 32.0 percent to 71.8 percent ✅ English OCR improved from 24.8 percent to 46.6 percent ✅ Chinese OCR improved from 30.8 percent to 38.0 percent The most interesting part is the shift. Robots do not only consume foundation models. They create the data these models are missing. A clean robot powered data flywheel. Work. Collect. Fine tune. Repeat. Thanks for sharing, Jenn Grannen! If you want the full write up: 📍Website: Paper: —- Weekly robotics and AI insights. Subscribe free:

A simple idea. Let robots collect the data that current foundation models are missing. A robot that gets better by doing real work in the real world. For two weeks in the Stanford East Asia Library, Scanford scanned shelves, helped librarians, and improved the vision language model it depends on. The idea is very simple: Robots do useful work. They gather the real world data foundation models never see online. They fine tune their own model They go back out stronger A full loop. What they found in deployment: ✅ 2103 shelves scanned with multilingual, faded, occluded book spines ✅ 18.7 hours of librarian time saved ✅ Book ID accuracy jumped from 32.0 percent to 71.8 percent ✅ English OCR improved from 24.8 percent to 46.6 percent ✅ Chinese OCR improved from 30.8 percent to 38.0 percent The most interesting part is the shift. Robots do not only consume foundation models. They create the data these models are missing. A clean robot powered data flywheel. Work. Collect. Fine tune. Repeat. Thanks for sharing, Jenn Grannen! If you want the full write up: 📍Website: Paper: —- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

44,660 次观看 • 8 个月前

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

Pretraining is essential for good performance on a wide variety of robotics tasks, and so most vision-language-action models build off of a vision language model (VLM) trained on a wide variety of image-language data. But how does the choice of VLM translate to downstream robotics performance? Jianke Zhang and @GYanjiang join us to talk about this key part of the robot policy, looking at a wide variety of different VLMs and how they perform. Interestingly, they see that performance on auxiliary tasks like quesiton answering did not lead to downstream improvements in control. To learn more, watch episode 65 of RoboPapers now, with Chris Paxton and Jiafei Duan!

RoboPapers

23,905 次观看 • 4 个月前

Robots has a data problem, in that robotics data is rare. While human video is quite common, it’s not usually directly usable for robots for a variety of reasons, most significantly that it’s missing explicit, accurate robot actions. Instead, Jeremy Collins proposes that we predict keypoint trajectories — basically, how any given point in an object will move as a robot performs a task. This lets us use action-free human video to train robot skills. Learn more by watching Episode #37 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton now.

Robots has a data problem, in that robotics data is rare. While human video is quite common, it’s not usually directly usable for robots for a variety of reasons, most significantly that it’s missing explicit, accurate robot actions. Instead, Jeremy Collins proposes that we predict keypoint trajectories — basically, how any given point in an object will move as a robot performs a task. This lets us use action-free human video to train robot skills. Learn more by watching Episode #37 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton now.

RoboPapers

89,011 次观看 • 9 个月前

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Alexander Koch

437,856 次观看 • 2 年前

With enough data, robots and AI can learn “world models” that let them predict the results of their actions. These models are a way to learn how embodied AI agents can perform a wide variety of useful tasks — but they require a huge amount of data. The team at General Intuition General Intuition has a solution: use data from video games! Games teach movement, problem solving, and complex spatial reasoning, and they come in a staggering diversity of forms, covering a wide variety of problems. What’s more, the captured data is high-quality, without the noise or annotation error that can come from We sat down with Pim de Witte and Adam Jelley from the General Intuition team to learn more about their history, their plans, and their philosophy.

With enough data, robots and AI can learn “world models” that let them predict the results of their actions. These models are a way to learn how embodied AI agents can perform a wide variety of useful tasks — but they require a huge amount of data. The team at General Intuition General Intuition has a solution: use data from video games! Games teach movement, problem solving, and complex spatial reasoning, and they come in a staggering diversity of forms, covering a wide variety of problems. What’s more, the captured data is high-quality, without the noise or annotation error that can come from We sat down with Pim de Witte and Adam Jelley from the General Intuition team to learn more about their history, their plans, and their philosophy.

RoboPapers

85,927 次观看 • 8 个月前