Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

🤖 One predictive backbone, three distinct tasks, consistent gains: a strong signal that investing in reusable world models is the right abstraction for robot learning. Toyota Research Institute (TRI) just ran the same idea three different ways—and it worked each time. Using our NVIDIA Cosmos Predict 2–style world models,... show more

NVIDIA AI Developer

110,898 subscribers

38,100 görüntüleme • 5 ay önce •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

雪踏乌云

112,114 görüntüleme • 15 gün önce

OpenAI's Deep Research is getting a run for its money. Deep Lake was just released, and it's a different take on an AI system that can do deep research on your own data. You can use Deep Lake to build AI search with reasoning on your private and public data. (Look at the attached videos to get an idea of how it works.) If you want to research proprietary and sensitive data, Deep Research won't help you because it's limited to public data. Deep Lake, however, will allow you to use your private data. On top of that, Deep Lake supports multi-modal retrieval from the ground up. It uses vision language models for data ingestion and retrieval so that you can connect any data (PDFs, images, videos, structured data, etc.) You can even use mixed-data queries! Deep Lake can search your data from S3, Dropbox, and GCP. It learns from your queries over time, making the results as relevant to your work as possible!

OpenAI's Deep Research is getting a run for its money. Deep Lake was just released, and it's a different take on an AI system that can do deep research on your own data. You can use Deep Lake to build AI search with reasoning on your private and public data. (Look at the attached videos to get an idea of how it works.) If you want to research proprietary and sensitive data, Deep Research won't help you because it's limited to public data. Deep Lake, however, will allow you to use your private data. On top of that, Deep Lake supports multi-modal retrieval from the ground up. It uses vision language models for data ingestion and retrieval so that you can connect any data (PDFs, images, videos, structured data, etc.) You can even use mixed-data queries! Deep Lake can search your data from S3, Dropbox, and GCP. It learns from your queries over time, making the results as relevant to your work as possible!

Santiago

171,340 görüntüleme • 1 yıl önce

In a masterclass at Sequoia Capital AI Ascent, Jim Fan laid out the "Great Parallel": how robotics is speedrunning the LLM playbook. 🔹 VLA → WAM: Moving from language-heavy models to "World Action Models" that dream in physics. 🔹 Teleop → EgoScale: Replacing manual data with human egocentric video. 🔹 Simulation 2.0: Using neural simulators like DreamDojo to turn compute into environments. "Our generation was born too late to explore the earth and too early to explore the stars. But we are born just in time to solve robotics." He believes that robots will pass the Physical Turing Test in the coming 2–3 years.

In a masterclass at Sequoia Capital AI Ascent, Jim Fan laid out the "Great Parallel": how robotics is speedrunning the LLM playbook. 🔹 VLA → WAM: Moving from language-heavy models to "World Action Models" that dream in physics. 🔹 Teleop → EgoScale: Replacing manual data with human egocentric video. 🔹 Simulation 2.0: Using neural simulators like DreamDojo to turn compute into environments. "Our generation was born too late to explore the earth and too early to explore the stars. But we are born just in time to solve robotics." He believes that robots will pass the Physical Turing Test in the coming 2–3 years.

Humanoids daily

12,123 görüntüleme • 2 ay önce

This is some quietly impressive work on making video world models actually controllable in 4D space. VerseCrafter lets you take an input image, use something like Blender to animate the 3D camera path and object trajectories, then uses that to condition generation. Scribbling in 2D feels so crude in comparison. The authors represent everything in a shared 4D world state - static background as a point cloud, moving objects as 3D gaussian trajectories. The gaussians are an interesting choice because they capture position, shape, and orientation probabilistically rather than forcing rigid bounding boxes or category specific models like SMPL-X for human bodies. They bolt this onto frozen Wan2.1 with a lightweight adapter, so they get a strong video prior. They also built a pipeline to auto extract 4D annotations from real world videos to train this puppy. It doesn't look sexy yet, but IMO this is the interface video world models need - actual 3D authoring tools to exert control rather than crude scribbles and prompt incantations.

This is some quietly impressive work on making video world models actually controllable in 4D space. VerseCrafter lets you take an input image, use something like Blender to animate the 3D camera path and object trajectories, then uses that to condition generation. Scribbling in 2D feels so crude in comparison. The authors represent everything in a shared 4D world state - static background as a point cloud, moving objects as 3D gaussian trajectories. The gaussians are an interesting choice because they capture position, shape, and orientation probabilistically rather than forcing rigid bounding boxes or category specific models like SMPL-X for human bodies. They bolt this onto frozen Wan2.1 with a lightweight adapter, so they get a strong video prior. They also built a pipeline to auto extract 4D annotations from real world videos to train this puppy. It doesn't look sexy yet, but IMO this is the interface video world models need - actual 3D authoring tools to exert control rather than crude scribbles and prompt incantations.

Bilawal Sidhu

25,802 görüntüleme • 6 ay önce

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share **PhysTwin**: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines **Gaussian splatting** with **inverse dynamics optimization** based on simple **spring-mass** systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share PhysTwin: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines Gaussian splatting with inverse dynamics optimization based on simple spring-mass systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

Yunzhu Li

25,279 görüntüleme • 1 yıl önce

Predicting the next word "only" is sufficient for language models to learn a large body of knowledge that enables then to code, answer questions, understand many topics, chat, and so on. This is clear to many researchers now, and there are nice tutorials on why this works by Ilya Sutskever resorting to compression ( ) and by Geoffrey Hinton ( ). However, the emergence of types of understanding is not unique to language models. In by Misha Denil and Brandon Amos the authors trained models to predict the next few time stems of over a hundred robot hand sensors (Touch, Gyro, Accelerometer, Joint Info, Actuator Info, etc.). They ten found out that they could regress the shape of the thing the hand was touching from the activations of the neural networks using probes. That is, the model developed an internal representation of shapes even though it was simply used to predict "only" the next few senses. Awareness follows from simple predictions and interaction with the world.

Predicting the next word "only" is sufficient for language models to learn a large body of knowledge that enables then to code, answer questions, understand many topics, chat, and so on. This is clear to many researchers now, and there are nice tutorials on why this works by Ilya Sutskever resorting to compression ( ) and by Geoffrey Hinton ( ). However, the emergence of types of understanding is not unique to language models. In by Misha Denil and Brandon Amos the authors trained models to predict the next few time stems of over a hundred robot hand sensors (Touch, Gyro, Accelerometer, Joint Info, Actuator Info, etc.). They ten found out that they could regress the shape of the thing the hand was touching from the activations of the neural networks using probes. That is, the model developed an internal representation of shapes even though it was simply used to predict "only" the next few senses. Awareness follows from simple predictions and interaction with the world.

Nando de Freitas

134,252 görüntüleme • 2 yıl önce

Today may be the ImageNet moment for robotics. RT-X: the largest open-source robot dataset ever compiled, across 33 institutes, 22 robot hardware, 527 skills, and 1M episodes. Why is robotics lagging so far behind NLP, vision, and other AI domains? Data scarcity is the main culprit to blame, among other difficulties. Unlike text, images, and videos, you cannot download mass amounts of onboard robot control data from the internet. They simply don't exist in the wild. 11 yrs ago, ImageNet kicked off the deep learning revolution. 3-4 yrs ago, internet-scale data fueled the first GPTs and Diffusions that define this era of foundation models. I think 2023 is finally the year for robotics to scale up. Robot foundation models like VIMA ( my team's work at NVIDIA) and RT-1/2 ( Google DeepMind's effort) are extremely data hungry. While massively parallel simulations like NVIDIA IsaacGym & Omniverse can alleviate the problem to some extent, it's still not quite enough to bridge the gap to the messy, physical world. This new dataset is not just a technical contribution. I also see it as a commendable effort to overcome institutional bureaucracies and unite researchers from around the world to tackle a grand challenge together. Robotics will be the final holy grail that we capture in AI. We are not there yet, but ascending in the right gradient direction. RT-X website: Launch blog:

Today may be the ImageNet moment for robotics. RT-X: the largest open-source robot dataset ever compiled, across 33 institutes, 22 robot hardware, 527 skills, and 1M episodes. Why is robotics lagging so far behind NLP, vision, and other AI domains? Data scarcity is the main culprit to blame, among other difficulties. Unlike text, images, and videos, you cannot download mass amounts of onboard robot control data from the internet. They simply don't exist in the wild. 11 yrs ago, ImageNet kicked off the deep learning revolution. 3-4 yrs ago, internet-scale data fueled the first GPTs and Diffusions that define this era of foundation models. I think 2023 is finally the year for robotics to scale up. Robot foundation models like VIMA ( my team's work at NVIDIA) and RT-1/2 ( Google DeepMind's effort) are extremely data hungry. While massively parallel simulations like NVIDIA IsaacGym & Omniverse can alleviate the problem to some extent, it's still not quite enough to bridge the gap to the messy, physical world. This new dataset is not just a technical contribution. I also see it as a commendable effort to overcome institutional bureaucracies and unite researchers from around the world to tackle a grand challenge together. Robotics will be the final holy grail that we capture in AI. We are not there yet, but ascending in the right gradient direction. RT-X website: Launch blog:

Jim Fan

265,038 görüntüleme • 2 yıl önce

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Shikun Liu

22,367 görüntüleme • 9 ay önce

Robora Sim: A PyBullet-Powered Environment for Learning Robotic Physical Intelligence We are currently building our Robora simulation environment setup for our sim based learning, leveraging PyBullet, an industry-standard physics engine widely used in AI-driven robotics research and development. The environment is optimized with GPU-accelerated learning algorithms, enabling high-speed imitation learning and reinforcement learning within a safe and controlled virtual setup before shipping out to real world. This simulation platform allows our models to learn, adapt, and generalize across different robot morphologies, terrain types and task objectives - all before deployment to the real world. At it's core, the system combines a VLA-powered high-level planner with low-level motion control algorithms, working cohesively to produce emergent, physically intelligent behaviors. This synergy between simulation, learning, and real-world transfer marks a major step forward in our pursuit of adaptive and intelligent robotic systems. Through advanced domain randomization and synthetic data generation, the Robora Simulation Environment ensures that policies trained in simulation transfer effectively to real-world robots, minimizing the sim-to-real gap. Moreover, users will be able to test and integrate their own hardware kits within selected simulation environments in the Robora Dapp, ensuring seamless compatibility and safer real-world implementation.

Robora Sim: A PyBullet-Powered Environment for Learning Robotic Physical Intelligence We are currently building our Robora simulation environment setup for our sim based learning, leveraging PyBullet, an industry-standard physics engine widely used in AI-driven robotics research and development. The environment is optimized with GPU-accelerated learning algorithms, enabling high-speed imitation learning and reinforcement learning within a safe and controlled virtual setup before shipping out to real world. This simulation platform allows our models to learn, adapt, and generalize across different robot morphologies, terrain types and task objectives - all before deployment to the real world. At it's core, the system combines a VLA-powered high-level planner with low-level motion control algorithms, working cohesively to produce emergent, physically intelligent behaviors. This synergy between simulation, learning, and real-world transfer marks a major step forward in our pursuit of adaptive and intelligent robotic systems. Through advanced domain randomization and synthetic data generation, the Robora Simulation Environment ensures that policies trained in simulation transfer effectively to real-world robots, minimizing the sim-to-real gap. Moreover, users will be able to test and integrate their own hardware kits within selected simulation environments in the Robora Dapp, ensuring seamless compatibility and safer real-world implementation.

Robora

23,489 görüntüleme • 9 ay önce

A Letter to Our Community: The Road Ahead for Robotics To our Community and Partners, As we step into 2026, our mission at Axis is clearer than ever: Constructing the definitive End-to-End Scaling Layer for Robotics. Our goal is to accelerate the transfer of diverse human intelligence into Robotics General Intelligence (RGI). By owning the critical path of intelligence creation, we are turning the physical limitations of robotics into a scalable, software-driven future. Here is our strategic outlook and roadmap for the year ahead. The Core Thesis: Simulation is the Only Way Out The path to RGI is currently blocked by Data Scarcity, Generalization Fragility, and Hardware Fragmentation. At Axis, we believe Simulation is the only way out. Our Simulation Data Platform and Data Augmentation Engine transform raw data into "Synthetic Gold". Backed by academic milestones like Roboverse, Skill Blending, and GraspVLA, we have proven that pure simulation can achieve the generalization required for the real world. We don’t just collect data; we architect it. The Engine: Why Crypto? We believe RGI should come from all, not a few. Crypto is not just a feature; it is the primitive that powers our entire ecosystem flywheel: - Incentive Mechanism: Democratizing contribution and rewarding the trainers and developers. - Assetization: Turning proprietary data and refined models into liquid, ownable assets. - Verifiable Workflow: We are opening the "Black Box" of AI. By bringing total transparency to the Task Generation → Data Collection → Model Training pipeline, we ensure every byte of intelligence is verifiable, traceable, and secure. 2026 Strategic Deliverables This year, we are committed to delivering three foundational pillars: - The World's Largest Training Dataset for Robots: A robot training set—diverse, high-quality interaction data at an unprecedented scale. - A Robotics Foundation Model: A universal robotic brain trained on our pure simulation and synthetic data, capable of robust cross-embodiment transfer and open-world adaptability. - Evolvable Robot Hardware: Robots deployed with Axis models that autonomously evolve through continuous interaction, turning every deployment into a self-improving node within our RGI network. The Ultimate Vision We are building more than models; we are architecting the Distributed Machine Economy. A future where every dataset, model, and robotic embodiment is a verifiable asset in a global, autonomous network. Thank you for building the future of intelligence with us✌️📷

A Letter to Our Community: The Road Ahead for Robotics To our Community and Partners, As we step into 2026, our mission at Axis is clearer than ever: Constructing the definitive End-to-End Scaling Layer for Robotics. Our goal is to accelerate the transfer of diverse human intelligence into Robotics General Intelligence (RGI). By owning the critical path of intelligence creation, we are turning the physical limitations of robotics into a scalable, software-driven future. Here is our strategic outlook and roadmap for the year ahead. The Core Thesis: Simulation is the Only Way Out The path to RGI is currently blocked by Data Scarcity, Generalization Fragility, and Hardware Fragmentation. At Axis, we believe Simulation is the only way out. Our Simulation Data Platform and Data Augmentation Engine transform raw data into "Synthetic Gold". Backed by academic milestones like Roboverse, Skill Blending, and GraspVLA, we have proven that pure simulation can achieve the generalization required for the real world. We don’t just collect data; we architect it. The Engine: Why Crypto? We believe RGI should come from all, not a few. Crypto is not just a feature; it is the primitive that powers our entire ecosystem flywheel: - Incentive Mechanism: Democratizing contribution and rewarding the trainers and developers. - Assetization: Turning proprietary data and refined models into liquid, ownable assets. - Verifiable Workflow: We are opening the "Black Box" of AI. By bringing total transparency to the Task Generation → Data Collection → Model Training pipeline, we ensure every byte of intelligence is verifiable, traceable, and secure. 2026 Strategic Deliverables This year, we are committed to delivering three foundational pillars: - The World's Largest Training Dataset for Robots: A robot training set—diverse, high-quality interaction data at an unprecedented scale. - A Robotics Foundation Model: A universal robotic brain trained on our pure simulation and synthetic data, capable of robust cross-embodiment transfer and open-world adaptability. - Evolvable Robot Hardware: Robots deployed with Axis models that autonomously evolve through continuous interaction, turning every deployment into a self-improving node within our RGI network. The Ultimate Vision We are building more than models; we are architecting the Distributed Machine Economy. A future where every dataset, model, and robotic embodiment is a verifiable asset in a global, autonomous network. Thank you for building the future of intelligence with us✌️📷

Axis Robotics

27,858 görüntüleme • 6 ay önce

Imagine having a ping pong robot! 🏓 Researchers and developers building physical AI: meet Reachy 2 from Pollen Robotics, an open-source, humanoid robot for real-world experimentation. It’s a bimanual mobile manipulator: each 7-DOF arm mimics human proportions and can lift up to 3 kg, giving dexterity for object handling. It can be controlled with Python and ROS2 Humble, or go straight into VR teleoperation, use a headset to move Reachy’s arms, hands, and head, and see through its cameras as if you’re in the robot’s own body. Want it to move around? A mobile base with three omnidirectional wheels, rich sensors, and LiDAR lets Reachy 2 navigate and explore its surroundings smoothly. 🗺️ Under the hood, it’s powered by a CPU system that’s ready for machine learning, perfect for loading AI frameworks and testing new models from Hugging Face directly on the robot. Keep making robots more, and more accessible Pollen team! ... and keep making more open source models to make robots more mainstream clem 🤗!

Imagine having a ping pong robot! 🏓 Researchers and developers building physical AI: meet Reachy 2 from Pollen Robotics, an open-source, humanoid robot for real-world experimentation. It’s a bimanual mobile manipulator: each 7-DOF arm mimics human proportions and can lift up to 3 kg, giving dexterity for object handling. It can be controlled with Python and ROS2 Humble, or go straight into VR teleoperation, use a headset to move Reachy’s arms, hands, and head, and see through its cameras as if you’re in the robot’s own body. Want it to move around? A mobile base with three omnidirectional wheels, rich sensors, and LiDAR lets Reachy 2 navigate and explore its surroundings smoothly. 🗺️ Under the hood, it’s powered by a CPU system that’s ready for machine learning, perfect for loading AI frameworks and testing new models from Hugging Face directly on the robot. Keep making robots more, and more accessible Pollen team! ... and keep making more open source models to make robots more mainstream clem 🤗!

Lukas Ziegler

37,221 görüntüleme • 11 ay önce

World modeling and imitation learning have largely been considered two disparate worlds. In our recent work, Unified World Models, just accepted to #RSS2025, Chuning Zhu provides a dead-simple unifying solution: just train a joint diffusion model over actions and future states, but with *decoupled* diffusion time steps across these modalities. Manipulating these decoupled time steps then allows for marginalization or conditioning on actions or states; a single model can serve as a policy, forward dynamics model, video prediction model, or inverse dynamics model by simply setting diffusion timesteps carefully. The resulting model can leverage video datasets along with robot training data much more effectively, and shows improved robustness, generalization, and flexibility. This is exciting because it is frustratingly simple, scalable, and shows strong improvement on real-world robotics problems. Please refer to Chuning Zhu 's excellent thread for more details! More details/code can be found on our website and in the paper -

World modeling and imitation learning have largely been considered two disparate worlds. In our recent work, Unified World Models, just accepted to #RSS2025, Chuning Zhu provides a dead-simple unifying solution: just train a joint diffusion model over actions and future states, but with decoupled diffusion time steps across these modalities. Manipulating these decoupled time steps then allows for marginalization or conditioning on actions or states; a single model can serve as a policy, forward dynamics model, video prediction model, or inverse dynamics model by simply setting diffusion timesteps carefully. The resulting model can leverage video datasets along with robot training data much more effectively, and shows improved robustness, generalization, and flexibility. This is exciting because it is frustratingly simple, scalable, and shows strong improvement on real-world robotics problems. Please refer to Chuning Zhu 's excellent thread for more details! More details/code can be found on our website and in the paper -

Abhishek Gupta

11,430 görüntüleme • 1 yıl önce

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 görüntüleme • 1 yıl önce

The architecture of this new world model is one of the most interesting things I've seen lately: Let me first explain how most world models work: They predict and render one frame at a time. If you are navigating in one of these worlds, and you look left, the model draws whatever looks right in the moment. Every time you change your viewpoint, the model has to imagine what should be there again, so it's very common for these models to "forget" what's in the world. For example, if you put a toy on the table, look away, then look back, the toy might not be there anymore. Tripo AI is releasing its Project Eden model, which works very differently: The model builds the world first, and then renders it based on that map. That map holds the real state of the world: the geometry, every object, where things are, what's already happened. The picture you see on screen gets generated from the map. This architecture flips the whole thing. Now, you get the following: 1. The world stops forgetting. Leave, come back, and the toy is still on the table because it lives in the map, not in the last frame you saw. 2. You can edit the world, and those changes persist for anyone who enters later. 3. Multiple people and AI agents can coexist in the world and see it from different perspectives. This is early research, but it's looking really promising. They just raised nearly $200M across two rounds to build it out. Tripo will be at SIGGRAPH 2026 (July 19–23, Los Angeles Convention Center). If you work in 3D, embodied AI, simulation, or anything spatial, go connect with them there.

The architecture of this new world model is one of the most interesting things I've seen lately: Let me first explain how most world models work: They predict and render one frame at a time. If you are navigating in one of these worlds, and you look left, the model draws whatever looks right in the moment. Every time you change your viewpoint, the model has to imagine what should be there again, so it's very common for these models to "forget" what's in the world. For example, if you put a toy on the table, look away, then look back, the toy might not be there anymore. Tripo AI is releasing its Project Eden model, which works very differently: The model builds the world first, and then renders it based on that map. That map holds the real state of the world: the geometry, every object, where things are, what's already happened. The picture you see on screen gets generated from the map. This architecture flips the whole thing. Now, you get the following: 1. The world stops forgetting. Leave, come back, and the toy is still on the table because it lives in the map, not in the last frame you saw. 2. You can edit the world, and those changes persist for anyone who enters later. 3. Multiple people and AI agents can coexist in the world and see it from different perspectives. This is early research, but it's looking really promising. They just raised nearly $200M across two rounds to build it out. Tripo will be at SIGGRAPH 2026 (July 19–23, Los Angeles Convention Center). If you work in 3D, embodied AI, simulation, or anything spatial, go connect with them there.

Santiago

30,189 görüntüleme • 1 ay önce

Back when we were developing GEN3C, we often imagined a Holodeck-like future: a simulator where multiple agents can enter the same generated world, act independently, and learn to collaborate. Gamma-World makes this feel more concrete. It is a generative multi-agent world model that takes synchronized observations and actions, then rolls out what each agent will see next in the same evolving world — action-responsive at 24 FPS. For me, the key challenge is going beyond two players. As more agents enter, identity cannot be tied to fixed slots, interaction cannot rely on dense pairwise attention, and independent actions still need to resolve into one shared state. Two ideas make this work: 1⃣ Simplex RoPE Distinct agent identities without slot bias — unique, but permutation-equivalent. 2⃣ Sparse Hub Attention Agents communicate through learnable hubs instead of dense all-to-all attention: agent → hub → agent This keeps cross-agent communication scalable. The exciting part: training on two-player data can generalize to four-player rollouts without additional training, and the same formulation extends to real-world bimanual robot coordination. A step toward populated world models: many agents, one shared world. Congrats to the team on Gamma-World! Project:

Back when we were developing GEN3C, we often imagined a Holodeck-like future: a simulator where multiple agents can enter the same generated world, act independently, and learn to collaborate. Gamma-World makes this feel more concrete. It is a generative multi-agent world model that takes synchronized observations and actions, then rolls out what each agent will see next in the same evolving world — action-responsive at 24 FPS. For me, the key challenge is going beyond two players. As more agents enter, identity cannot be tied to fixed slots, interaction cannot rely on dense pairwise attention, and independent actions still need to resolve into one shared state. Two ideas make this work: 1⃣ Simplex RoPE Distinct agent identities without slot bias — unique, but permutation-equivalent. 2⃣ Sparse Hub Attention Agents communicate through learnable hubs instead of dense all-to-all attention: agent → hub → agent This keeps cross-agent communication scalable. The exciting part: training on two-player data can generalize to four-player rollouts without additional training, and the same formulation extends to real-world bimanual robot coordination. A step toward populated world models: many agents, one shared world. Congrats to the team on Gamma-World! Project:

Xuanchi Ren

304,145 görüntüleme • 2 ay önce

🚨 BREAKING: NVIDIA just announced the Isaac GR00T Reference Humanoid Robot. The first fully open humanoid robot reference design built on Jetson Thor, and it's going straight to the world's top research institutions. This is Jensen Huang's bet on open physical AI infrastructure. The hardware stack is serious: → Unitree H2 Plus chassis, 6 feet tall, 150 pounds, 31 degrees of freedom → Sharpa Wave tactile five-finger hands, 22 degrees of freedom, bringing total to 75 across the full body → NVIDIA Jetson AGX Thor onboard compute, 2,070 FP4 teraflops of AI performance, 128GB unified memory → Multi-view sensing, stereo head camera, wrist cameras, IMU Alongside this announcement, Unitree also introduced the H2 Plus as a standalone product, a frontier humanoid combining Unitree's own body, Sharpa's five-finger hands and NVIDIA Robotics Jetson Thor compute into one fully integrated research platform. The full Isaac GR00T software stack ships with it, teleoperation for data capture, open foundation models, Isaac Sim for training, Isaac Lab for evaluation, and accelerated ROS middleware for deployment. The complete loop from data to real-world robot in one unified platform. ETH Zürich, Stanford Robotics Center, UC San Diego and Ai2 are already on board as launch research partners. NVIDIA Robotics did to AI what it's now doing to robotics, build the platform, open the ecosystem, let the world build on top of it. Whoever owns the infrastructure layer wins. NVIDIA knows this better than anyone. 👀 Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

🚨 BREAKING: NVIDIA just announced the Isaac GR00T Reference Humanoid Robot. The first fully open humanoid robot reference design built on Jetson Thor, and it's going straight to the world's top research institutions. This is Jensen Huang's bet on open physical AI infrastructure. The hardware stack is serious: → Unitree H2 Plus chassis, 6 feet tall, 150 pounds, 31 degrees of freedom → Sharpa Wave tactile five-finger hands, 22 degrees of freedom, bringing total to 75 across the full body → NVIDIA Jetson AGX Thor onboard compute, 2,070 FP4 teraflops of AI performance, 128GB unified memory → Multi-view sensing, stereo head camera, wrist cameras, IMU Alongside this announcement, Unitree also introduced the H2 Plus as a standalone product, a frontier humanoid combining Unitree's own body, Sharpa's five-finger hands and NVIDIA Robotics Jetson Thor compute into one fully integrated research platform. The full Isaac GR00T software stack ships with it, teleoperation for data capture, open foundation models, Isaac Sim for training, Isaac Lab for evaluation, and accelerated ROS middleware for deployment. The complete loop from data to real-world robot in one unified platform. ETH Zürich, Stanford Robotics Center, UC San Diego and Ai2 are already on board as launch research partners. NVIDIA Robotics did to AI what it's now doing to robotics, build the platform, open the ecosystem, let the world build on top of it. Whoever owns the infrastructure layer wins. NVIDIA knows this better than anyone. 👀 Read more here: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

16,062 görüntüleme • 2 ay önce

🧵 Understanding Zama; the future of privacy tech & homomorphic encryption 1️⃣ Zama is pioneering fully homomorphic encryption (FHE). A breakthrough that lets you compute on encrypted data without decrypting it. 🔐 That means total privacy, even the system running your data can’t see it. 2️⃣ Why it matters: Right now, cloud apps, AI models, and databases must access your raw data to work. FHE changes that. your data stays private while still usable. 3️⃣ Zama builds open-source FHE tools for developers, turning advanced cryptography into practical products for AI, blockchain, and Web3. 4️⃣ Imagine: •AI that learns without reading your secrets 🤖 •Blockchain transactions with zero data leaks •Cloud apps that never see your info 5️⃣ Zama’s mission: Privacy should be the default, not an option. They’re making privacy-preserving tech simple, scalable, and open for everyone. 🔚 In a world obsessed with data, Zama might just be building the encryption layer of the future internet. 🌐

🧵 Understanding Zama; the future of privacy tech & homomorphic encryption 1️⃣ Zama is pioneering fully homomorphic encryption (FHE). A breakthrough that lets you compute on encrypted data without decrypting it. 🔐 That means total privacy, even the system running your data can’t see it. 2️⃣ Why it matters: Right now, cloud apps, AI models, and databases must access your raw data to work. FHE changes that. your data stays private while still usable. 3️⃣ Zama builds open-source FHE tools for developers, turning advanced cryptography into practical products for AI, blockchain, and Web3. 4️⃣ Imagine: •AI that learns without reading your secrets 🤖 •Blockchain transactions with zero data leaks •Cloud apps that never see your info 5️⃣ Zama’s mission: Privacy should be the default, not an option. They’re making privacy-preserving tech simple, scalable, and open for everyone. 🔚 In a world obsessed with data, Zama might just be building the encryption layer of the future internet. 🌐

v͙e͙s͙p͙e͙r͙ 📊🐐

19,644 görüntüleme • 8 ay önce

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Michael Black

22,182 görüntüleme • 7 ay önce

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

The term "continual learning" has become overloaded if you see it as an ML problem. One classic thread is about memorization: regularization-based continual learning methods, such as EWC, MAS, and SI, estimate which parameters mattered for previous tasks and resist changing them too much. One modern thread is about adaptation: test-time training and inference-time learning methods, such as TTT, adapt part of the model on the incoming test stream before making predictions. These are sometimes discussed as separate threads. But in modern scalable architectures, I think they are better seen as complementary constraints: a model that learns quickly at test time also benefits from a mechanism for deciding what not to forget. In our #ECCV2026 paper, we study this in large-scale 4D reconstruction: how to build fast spatial memory that can adapt over long observation streams while reducing collapse and forgetting. Instead of using fully plastic test-time updates, we stabilize fast-weight adaptation with an elastic prior that balances adaptation and memory. Key ideas: - Elastic Test-Time Training: Fisher-weighted consolidation for fast-weight updates - EMA anchor weights that provide a moving reference for stability - Chunk-by-chunk inference for long 3D/4D observation streams We show that this scales across large 3D/4D pretraining settings, including both LRM-style and LVSM-style models, and improves reconstruction across benchmarks including Stereo4D, NVIDIA, and DL3DV-140. We release model checkpoints across different design choices: resolution, post-training curriculum, and whether the model uses an explicit 4DGS intermediate representation. - Homepage: - Paper: - Code: - Models: This work is co-led with Xueyang Yu, contributed by Haoyu Zhen Yuncong Yang, and advised by Michigan SLED Lab Chuang Gan.

Martin Ziqiao Ma

33,411 görüntüleme • 1 ay önce

𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲’𝘀 𝘁𝗮𝗹𝗸𝗶𝗻𝗴 𝗮𝗯𝗼𝘂𝘁 “𝗣𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗔𝗜" - the idea that we can simulate real-world environments so well that robots trained in simulation will work perfectly in reality. 𝗧𝗵𝗲 𝗽𝗿𝗼𝗺𝗶𝘀𝗲: Train in virtual worlds → deploy anywhere. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹𝗶𝘁𝘆: I’ve seen too many teams fall into this trap. After working with manipulation teams at Berkeley, Imperial, and Dyson, here’s the pattern: • 𝗪𝗲𝗲𝗸 𝟭: “Our policy works perfectly in simulation!” • 𝗪𝗲𝗲𝗸 𝟰: “Why doesn’t this work on real objects?” • 𝗠𝗼𝗻𝘁𝗵 𝟮: “We basically need to retrain from scratch with real data.” 𝗧𝗵𝗲 𝗴𝗮𝗽 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗶𝗼𝗻𝘀 𝗰𝗮𝗻’𝘁 𝗯𝗿𝗶𝗱𝗴𝗲: Unlike blind locomotion policies that can get away with sim-to-real transfer because they rely mainly on proprioception and contact forces, 𝘃𝗶𝘀𝗶𝗼𝗻-𝗴𝘂𝗶𝗱𝗲𝗱 𝗺𝗮𝗻𝗶𝗽𝘂𝗹𝗮𝘁𝗶𝗼𝗻 𝗶𝘀 𝗲𝘅𝘁𝗿𝗲𝗺𝗲𝗹𝘆 𝘀𝗲𝗻𝘀𝗶𝘁𝗶𝘃𝗲 𝘁𝗼 𝘃𝗶𝘀𝘂𝗮𝗹 𝗱𝗼𝗺𝗮𝗶𝗻 𝗴𝗮𝗽𝘀. • Real friction vs simulated surface textures • Manufacturing tolerances vs perfect CAD models • Dynamic lighting vs controlled virtual environments • Sensor noise vs instantaneous virtual readings 𝗛𝗲𝗿𝗲'𝘀 𝘄𝗵𝗮𝘁 𝗽𝗲𝗼𝗽𝗹𝗲 𝗱𝗼𝗻'𝘁 𝘁𝗮𝗹𝗸 𝗮𝗯𝗼𝘂𝘁: Building these detailed simulated environments takes forever. If it takes 7 days to build a simulated kitchen in simulation, wouldn't it be better to just collect real-world data in a real kitchen instead? 𝗗𝗼𝗻'𝘁 𝗴𝗲𝘁 𝗺𝗲 𝘄𝗿𝗼𝗻𝗴 - simulation is incredible for debugging, safety testing, and exploring edge cases. But it's not a magic solution to real-world deployment. 𝗪𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝘄𝗼𝗿𝗸𝘀: Use simulation strategically while making real-world data collection as efficient and flexible as possible. This is why Neuracore focuses on streamlined real-world data infrastructure. Because no amount of virtual training can replace understanding how your robot actually behaves in actual environments. 𝗧𝗵𝗲 𝗽𝗵𝘆𝘀𝗶𝗰𝘀 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗱𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 𝗲𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁 𝗰𝗮𝗻'𝘁 𝗯𝗲 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗲𝗱 𝗮𝘄𝗮𝘆. What’s been your experience with sim-to-real transfer?

𝗘𝘃𝗲𝗿𝘆𝗼𝗻𝗲’𝘀 𝘁𝗮𝗹𝗸𝗶𝗻𝗴 𝗮𝗯𝗼𝘂𝘁 “𝗣𝗵𝘆𝘀𝗶𝗰𝗮𝗹 𝗔𝗜" - the idea that we can simulate real-world environments so well that robots trained in simulation will work perfectly in reality. 𝗧𝗵𝗲 𝗽𝗿𝗼𝗺𝗶𝘀𝗲: Train in virtual worlds → deploy anywhere. 𝗧𝗵𝗲 𝗿𝗲𝗮𝗹𝗶𝘁𝘆: I’ve seen too many teams fall into this trap. After working with manipulation teams at Berkeley, Imperial, and Dyson, here’s the pattern: • 𝗪𝗲𝗲𝗸 𝟭: “Our policy works perfectly in simulation!” • 𝗪𝗲𝗲𝗸 𝟰: “Why doesn’t this work on real objects?” • 𝗠𝗼𝗻𝘁𝗵 𝟮: “We basically need to retrain from scratch with real data.” 𝗧𝗵𝗲 𝗴𝗮𝗽 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗶𝗼𝗻𝘀 𝗰𝗮𝗻’𝘁 𝗯𝗿𝗶𝗱𝗴𝗲: Unlike blind locomotion policies that can get away with sim-to-real transfer because they rely mainly on proprioception and contact forces, 𝘃𝗶𝘀𝗶𝗼𝗻-𝗴𝘂𝗶𝗱𝗲𝗱 𝗺𝗮𝗻𝗶𝗽𝘂𝗹𝗮𝘁𝗶𝗼𝗻 𝗶𝘀 𝗲𝘅𝘁𝗿𝗲𝗺𝗲𝗹𝘆 𝘀𝗲𝗻𝘀𝗶𝘁𝗶𝘃𝗲 𝘁𝗼 𝘃𝗶𝘀𝘂𝗮𝗹 𝗱𝗼𝗺𝗮𝗶𝗻 𝗴𝗮𝗽𝘀. • Real friction vs simulated surface textures • Manufacturing tolerances vs perfect CAD models • Dynamic lighting vs controlled virtual environments • Sensor noise vs instantaneous virtual readings 𝗛𝗲𝗿𝗲'𝘀 𝘄𝗵𝗮𝘁 𝗽𝗲𝗼𝗽𝗹𝗲 𝗱𝗼𝗻'𝘁 𝘁𝗮𝗹𝗸 𝗮𝗯𝗼𝘂𝘁: Building these detailed simulated environments takes forever. If it takes 7 days to build a simulated kitchen in simulation, wouldn't it be better to just collect real-world data in a real kitchen instead? 𝗗𝗼𝗻'𝘁 𝗴𝗲𝘁 𝗺𝗲 𝘄𝗿𝗼𝗻𝗴 - simulation is incredible for debugging, safety testing, and exploring edge cases. But it's not a magic solution to real-world deployment. 𝗪𝗵𝗮𝘁 𝗮𝗰𝘁𝘂𝗮𝗹𝗹𝘆 𝘄𝗼𝗿𝗸𝘀: Use simulation strategically while making real-world data collection as efficient and flexible as possible. This is why Neuracore focuses on streamlined real-world data infrastructure. Because no amount of virtual training can replace understanding how your robot actually behaves in actual environments. 𝗧𝗵𝗲 𝗽𝗵𝘆𝘀𝗶𝗰𝘀 𝗼𝗳 𝘆𝗼𝘂𝗿 𝗱𝗲𝗽𝗹𝗼𝘆𝗺𝗲𝗻𝘁 𝗲𝗻𝘃𝗶𝗿𝗼𝗻𝗺𝗲𝗻𝘁 𝗰𝗮𝗻'𝘁 𝗯𝗲 𝘀𝗶𝗺𝘂𝗹𝗮𝘁𝗲𝗱 𝗮𝘄𝗮𝘆. What’s been your experience with sim-to-real transfer?

Stephen James

25,347 görüntüleme • 10 ay önce