Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Robots might learn better from video than from language! 📼 Most Vision-Language-Action (VLA) models learn what to do from text, but still struggle with how things move in the real world. That makes them data-hungry and slow to train. mimic video takes a different route. Instead of grounding robot... control in text, it grounds it in video, using large pre-trained video models that already capture physical motion and dynamics. The idea is straightforward: let the video model handle “what will happen next,” and let a smaller control model focus only on turning that visual plan into robot actions. The result is big gains in practice. Robots trained this way need 10× less data, converge twice as fast, and perform better on both simulated benchmarks and real bimanual manipulation tasks. If robots can “imagine” motion using video, control becomes a much simpler problem. Shoutout to Jonas Pai, Liam Achenbach, Oier Mees, Elvis Nava and the rest of the team! Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →show more

Lukas Ziegler

55,354 subscribers

49,920 Aufrufe • vor 6 Monaten •via X (Twitter)

Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Turning video into humanoid robot motion! 🤳🏼 Training humanoid robots needs huge amounts of motion data, but real-world capture doesn’t scale. Mocap is expensive, dangerous edge cases are rare, and you can’t ask humans to repeatedly fall or crash. Video2Robot tackles this by converting videos into physics-grounded humanoid simulations. Motion is generated to respect balance, inertia, ground contact, and joint limits, then directly retargeted to robot simulators. One prompt can generate a full humanoid motion sequence, including multi-agent interactions and failure cases like falls or collisions, scenarios that are hard or impossible to capture safely in the real world. The pipeline is model-agnostic and works with existing video generators, making it a practical way to scale data for robots. If robots are going to operate in the real world, they need to be trained on the failures too, not just the perfect demos. Here's the GitHub: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Turning video into humanoid robot motion! 🤳🏼 Training humanoid robots needs huge amounts of motion data, but real-world capture doesn’t scale. Mocap is expensive, dangerous edge cases are rare, and you can’t ask humans to repeatedly fall or crash. Video2Robot tackles this by converting videos into physics-grounded humanoid simulations. Motion is generated to respect balance, inertia, ground contact, and joint limits, then directly retargeted to robot simulators. One prompt can generate a full humanoid motion sequence, including multi-agent interactions and failure cases like falls or collisions, scenarios that are hard or impossible to capture safely in the real world. The pipeline is model-agnostic and works with existing video generators, making it a practical way to scale data for robots. If robots are going to operate in the real world, they need to be trained on the failures too, not just the perfect demos. Here's the GitHub: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

59,693 Aufrufe • vor 6 Monaten

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about Mimic Robotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about Mimic Robotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

RoboPapers

46,190 Aufrufe • vor 1 Monat

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

🚨 BREAKING: Microsoft's first robotics foundation model! 🤯 Microsoft just announced Rho-alpha (ρα), their first robotics model derived from the Phi series of vision-language models. Rho-alpha translates natural language commands into control signals for robotic systems performing bimanual manipulation tasks. Commands like "push the green button with the right gripper," "pull out the red wire," "flip the top switch on," or "turn the knob to position 5" get executed directly by dual-arm robots. What makes this different from standard vision-language-action (VLA) models is the additional modalities. Rho-alpha is a VLA+ model that adds tactile sensing to the perceptual mix, with plans to incorporate force feedback. On the learning side, the model is designed to continually improve during deployment by learning from human feedback. The training approach combines trajectories from physical demonstrations and simulated tasks with web-scale visual question answering data. Since teleoperation data is scarce and expensive, Microsoft is using NVIDIA Isaac Sim on Azure to generate physically accurate synthetic datasets via reinforcement learning. These simulated trajectories get combined with commercial and open physical demonstration datasets. The model is currently under evaluation on dual-arm setups and humanoid robots. Microsoft is opening an Early Access Program for organizations interested in evaluating Rho-alpha. Robots that can adapt to dynamic situations and human preferences are more useful in real environments and more trusted by the people operating them. Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

60,847 Aufrufe • vor 5 Monaten

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 Aufrufe • vor 3 Jahren

Brain-controlled exoskeletons to train humanoid robots! 🧠 Fourier just presented human tele-operators using brain control interfaces and exoskeletal arms to train humanoid robots on home tasks. The brain control interface is the interesting part. Instead of using a controller or joystick to teleoperate, the operator's movements and intentions are captured more naturally through the exoskeleton and BCI. This means the demonstrations are more fluid, more human-like, and better suited for training robots to perform delicate home tasks. Multiple tele-operators are simultaneously generating training data across multiple robots. This is how you build the dataset needed for eventual full autonomy, without waiting years for it to arrive. This might be the bridge between "robots that work in controlled environments" and "robots that work in homes." Not full autonomy right away, but trusted human intelligence operating through a robot body, getting better with every task completed. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Brain-controlled exoskeletons to train humanoid robots! 🧠 Fourier just presented human tele-operators using brain control interfaces and exoskeletal arms to train humanoid robots on home tasks. The brain control interface is the interesting part. Instead of using a controller or joystick to teleoperate, the operator's movements and intentions are captured more naturally through the exoskeleton and BCI. This means the demonstrations are more fluid, more human-like, and better suited for training robots to perform delicate home tasks. Multiple tele-operators are simultaneously generating training data across multiple robots. This is how you build the dataset needed for eventual full autonomy, without waiting years for it to arrive. This might be the bridge between "robots that work in controlled environments" and "robots that work in homes." Not full autonomy right away, but trusted human intelligence operating through a robot body, getting better with every task completed. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

20,874 Aufrufe • vor 4 Monaten

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Excited to announce GR00T N1, the world’s first open foundation model for humanoid robots! We are on a mission to democratize Physical AI. The power of general robot brain, in the palm of your hand - with only 2B parameters, N1 learns from the most diverse physical action dataset ever compiled and punches above its weight: - Real humanoid teleoperation data. - Large-scale simulation data: we are open-sourcing 300K+ trajectories! - Neural trajectories: we apply SOTA video generation models to “hallucinate” new synthetic data that features accurate physics in pixels. Using Jensen’s words, “systematically infinite data”! - Latent actions: we develop novel algorithms to extract action tokens from in-the-wild human videos and neural generated videos. GR00T N1 is a single end-to-end neural net, from photons to actions: - Vision-Language Model (System 2) that interprets the physical world through vision and language instructions, enabling robots to reason about their environment and instructions, and plan the right actions. - Diffusion Transformer (System 1) that “renders” smooth and precise motor actions at 120 Hz, executing the latent plan made by System 2. We deploy N1 on GR1 robot, 1X Neo robot, and a large collection of simulation benchmarks. N1 achieves up to +30% boost in diverse manipulation tasks for household and industrial settings. While humanoid robots are the main focus of N1, our model also supports cross-embodiment. We finetune it to work on the $110 HuggingFace LeRobot SO100 robot arm! Open robot brain runs on open hardware. Sounds just right. Let’s solve robotics, together, one token at a time. Links to our Whitepaper, Github repo, HuggingFace model, and open dataset page in the thread: 🧵

Jim Fan

465,820 Aufrufe • vor 1 Jahr

TWIST: a real-time teleoperation system for humanoid robots to mimic whole-body motions. Reference motion data is generated by retargeting human motion-capture data to the robot. Then, the controller is trained in simulation using reinforcement learning and behavior cloning.

TWIST: a real-time teleoperation system for humanoid robots to mimic whole-body motions. Reference motion data is generated by retargeting human motion-capture data to the robot. Then, the controller is trained in simulation using reinforcement learning and behavior cloning.

The Humanoid Hub

46,715 Aufrufe • vor 1 Jahr

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Vision-language models can control robots, but what if the prompt is too complex for the robot to follow directly? We developed a way to get robots to “think through” complex instructions, feedback, and interjections. We call it the Hierarchical Interactive Robot (Hi Robot).

Physical Intelligence

116,845 Aufrufe • vor 1 Jahr

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

Our vision is for AI that uses world models to adapt in new and dynamic environments and efficiently learn new skills. We’re sharing V-JEPA 2, a new world model with state-of-the-art performance in visual understanding and prediction. V-JEPA 2 is a 1.2 billion-parameter model, trained on video, that can enable zero-shot planning in robots—allowing them to plan and execute tasks in unfamiliar environments. Learn more about V-JEPA 2 ➡️ As we continue working toward our goal of achieving advanced machine intelligence (AMI), we’re also releasing three new benchmarks for evaluating how well existing models can reason about the physical world from video. Learn more and download the new benchmarks ➡️

AI at Meta

309,704 Aufrufe • vor 1 Jahr

A toilet-cleaning robot from Reflex Robotics. The best part is when the robot dries its hands... That might be the future of robots doing chores for us. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

A toilet-cleaning robot from Reflex Robotics. The best part is when the robot dries its hands... That might be the future of robots doing chores for us. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

15,766 Aufrufe • vor 3 Monaten

FRIEDBERG: VIDEO DATA WILL POWER THE NEXT GENERATION OF AI Friedberg broke down the scale shift coming to artificial intelligence, arguing that text based models like GPT are just the beginning, and that the real revolution will come from video-trained systems: “The internet and all these LLMs are language models trained on text from the internet, around 50 billion words total, maybe one to five terabytes of data in their training sets. But if you look at the video data out there, there are hundreds of billions of hours, much of it on YouTube. By some estimates, there’s a thousand exabytes of video data on the internet, about a billion times more than text data. I think we just saw that play out with the new video model that launched yesterday. Google has all this YouTube data, whether or not they’re using it to train, I don’t know. I’ve heard from insiders they’re not allowed to yet and would have to redo the terms of service.” Source: AIFinInsights david friedberg

FRIEDBERG: VIDEO DATA WILL POWER THE NEXT GENERATION OF AI Friedberg broke down the scale shift coming to artificial intelligence, arguing that text based models like GPT are just the beginning, and that the real revolution will come from video-trained systems: “The internet and all these LLMs are language models trained on text from the internet, around 50 billion words total, maybe one to five terabytes of data in their training sets. But if you look at the video data out there, there are hundreds of billions of hours, much of it on YouTube. By some estimates, there’s a thousand exabytes of video data on the internet, about a billion times more than text data. I think we just saw that play out with the new video model that launched yesterday. Google has all this YouTube data, whether or not they’re using it to train, I don’t know. I’ve heard from insiders they’re not allowed to yet and would have to redo the terms of service.” Source: AIFinInsights david friedberg

Mario Nawfal

15,855 Aufrufe • vor 6 Monaten

I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5! DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process. At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation. What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world. Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day. The future of robot data is generative. The future of the entire robot learning pipeline will also be generative. 🧵

I don’t know if we live in a Matrix, but I know for sure that robots will spend most of their lives in simulation. Let machines train machines. I’m excited to introduce DexMimicGen, a massive-scale synthetic data generator that enables a humanoid robot to learn complex skills from only a handful of human demonstrations. Yes, as few as 5! DexMimicGen addresses the biggest pain point in robotics: where do we get data? Unlike with LLMs, where vast amounts of texts are readily available, you cannot simply download motor control signals from the internet. So researchers teleoperate the robots to collect motion data via XR headsets. They have to repeat the same skill over and over and over again, because neural nets are data hungry. This is a very slow and uncomfortable process. At NVIDIA, we believe the majority of high-quality tokens for robot foundation models will come from simulation. What DexMimicGen does is to trade GPU compute time for human time. It takes one motion trajectory from human, and multiplies into 1000s of new trajectories. A robot brain trained on this augmented dataset will generalize far better in the real world. Think of DexMimicGen as a learning signal amplifier. It maps a small dataset to a large (de facto infinite) dataset, using physics simulation in the loop. In this way, we free humans from babysitting the bots all day. The future of robot data is generative. The future of the entire robot learning pipeline will also be generative. 🧵

Jim Fan

165,246 Aufrufe • vor 1 Jahr

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Robots Digest 🤖

14,601 Aufrufe • vor 29 Tagen

Learning from robot data? Standard. Direct Video-Action Models (DVA) is different: treat robot control as video generation, then translate the generated video into actions. Built by , the system pre-trains causal video models from scratch and can run complex production tasks for hours using only ~10 hours of robot data. • hundreds of frames of visual context • real-time control via causal video prediction More: The team behind it just exited 18 months of stealth with a $450M Series A at a $1.7B valuation. Founded by Jagdeep Singh (ex-QuantumScape) with a Stanford-heavy science team: CSO Eric Ryan Chan (ex-WorldLabs) and Prof. Gordon Wetzstein. Already running in large-scale automotive production environments. Backed by Vinod Khosla Ventures, Temasek, Premji Invest, and John Doerr. Thanks for sharing, Tongzhou Mu 🤖🦾🦿 👋

Learning from robot data? Standard. Direct Video-Action Models (DVA) is different: treat robot control as video generation, then translate the generated video into actions. Built by , the system pre-trains causal video models from scratch and can run complex production tasks for hours using only ~10 hours of robot data. • hundreds of frames of visual context • real-time control via causal video prediction More: The team behind it just exited 18 months of stealth with a $450M Series A at a $1.7B valuation. Founded by Jagdeep Singh (ex-QuantumScape) with a Stanford-heavy science team: CSO Eric Ryan Chan (ex-WorldLabs) and Prof. Gordon Wetzstein. Already running in large-scale automotive production environments. Backed by Vinod Khosla Ventures, Temasek, Premji Invest, and John Doerr. Thanks for sharing, Tongzhou Mu 🤖🦾🦿 👋

Ilir Aliu

26,209 Aufrufe • vor 3 Monaten

New Gemini Robotics 1.5 models will enable robots to better reason, plan ahead, use digital tools like Search, and transfer learning from one kind of robot to another. Our next big step towards general-purpose robots that are truly helpful — you can see how the robot reasons as it sorts laundry in the video below.

New Gemini Robotics 1.5 models will enable robots to better reason, plan ahead, use digital tools like Search, and transfer learning from one kind of robot to another. Our next big step towards general-purpose robots that are truly helpful — you can see how the robot reasons as it sorts laundry in the video below.

Sundar Pichai

496,147 Aufrufe • vor 9 Monaten

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

Hongyu Li

105,442 Aufrufe • vor 8 Monaten

State-of-the-art robot policies often need hundreds of hours of data. What if we needed none? Introducing TiPToP: a manipulation system that zero-shots open-world tasks from pixels and language using vision foundation models and GPU-parallelized Task and Motion Planning (TAMP).

State-of-the-art robot policies often need hundreds of hours of data. What if we needed none? Introducing TiPToP: a manipulation system that zero-shots open-world tasks from pixels and language using vision foundation models and GPU-parallelized Task and Motion Planning (TAMP).

Nishanth Kumar

77,488 Aufrufe • vor 3 Monaten

Large language models reason through text. Vision‑language‑action models reason through the real world. By fusing perception, context, and action from live video, VLAs deliver the awareness physical AI needs for next‑gen robotics and edge systems.

Large language models reason through text. Vision‑language‑action models reason through the real world. By fusing perception, context, and action from live video, VLAs deliver the awareness physical AI needs for next‑gen robotics and edge systems.

Intel

15,931 Aufrufe • vor 4 Monaten

A simple idea. Let robots collect the data that current foundation models are missing. A robot that gets better by doing real work in the real world. For two weeks in the Stanford East Asia Library, Scanford scanned shelves, helped librarians, and improved the vision language model it depends on. The idea is very simple: Robots do useful work. They gather the real world data foundation models never see online. They fine tune their own model They go back out stronger A full loop. What they found in deployment: ✅ 2103 shelves scanned with multilingual, faded, occluded book spines ✅ 18.7 hours of librarian time saved ✅ Book ID accuracy jumped from 32.0 percent to 71.8 percent ✅ English OCR improved from 24.8 percent to 46.6 percent ✅ Chinese OCR improved from 30.8 percent to 38.0 percent The most interesting part is the shift. Robots do not only consume foundation models. They create the data these models are missing. A clean robot powered data flywheel. Work. Collect. Fine tune. Repeat. Thanks for sharing, Jenn Grannen! If you want the full write up: 📍Website: Paper: —- Weekly robotics and AI insights. Subscribe free:

A simple idea. Let robots collect the data that current foundation models are missing. A robot that gets better by doing real work in the real world. For two weeks in the Stanford East Asia Library, Scanford scanned shelves, helped librarians, and improved the vision language model it depends on. The idea is very simple: Robots do useful work. They gather the real world data foundation models never see online. They fine tune their own model They go back out stronger A full loop. What they found in deployment: ✅ 2103 shelves scanned with multilingual, faded, occluded book spines ✅ 18.7 hours of librarian time saved ✅ Book ID accuracy jumped from 32.0 percent to 71.8 percent ✅ English OCR improved from 24.8 percent to 46.6 percent ✅ Chinese OCR improved from 30.8 percent to 38.0 percent The most interesting part is the shift. Robots do not only consume foundation models. They create the data these models are missing. A clean robot powered data flywheel. Work. Collect. Fine tune. Repeat. Thanks for sharing, Jenn Grannen! If you want the full write up: 📍Website: Paper: —- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

44,660 Aufrufe • vor 7 Monaten