Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Many 3D generators output Gaussian Splats (3DGS) for fast rendering, flexible deployment, and high visual fidelity. Static 3DGS aren't world models (no dynamics/semantics) but a true world model must allow distilling 3D-consistent representations for any given time step (3DGS/meshes). This post-distillation serves a dual purpose: 1) validates physical consistency... show more

Matthias Niessner

48,694 subscribers

26,416 views • 4 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

🧊 Splats → Voxels: Bridging 3D Gaussian Splatting with Dynamic Voxels World Labs Just added this last tool to the GS editor plug-in that can convert any 3DGS splats into interactive voxel worlds with full physics support! 🚀 3 Rendering Modes: 🔵 CPU Mode - GameObjects for each voxel. Max physics interaction, perfect for destructible environments 🟢 GPU Mode - DrawMeshInstancedIndirect. Renders 100K+ voxels at 90fps. Pure visual, no physics overhead 🟡 Hybrid Mode - Best of both! GPU renders distant voxels Building this so you can literally punch through a GS World and watch it crumble 🥊 #GaussianSplatting #unity3d #GameDev #VR #3DGS #Voxels #Physics #madewithunity

🧊 Splats → Voxels: Bridging 3D Gaussian Splatting with Dynamic Voxels World Labs Just added this last tool to the GS editor plug-in that can convert any 3DGS splats into interactive voxel worlds with full physics support! 🚀 3 Rendering Modes: 🔵 CPU Mode - GameObjects for each voxel. Max physics interaction, perfect for destructible environments 🟢 GPU Mode - DrawMeshInstancedIndirect. Renders 100K+ voxels at 90fps. Pure visual, no physics overhead 🟡 Hybrid Mode - Best of both! GPU renders distant voxels Building this so you can literally punch through a GS World and watch it crumble 🥊 #GaussianSplatting #unity3d #GameDev #VR #3DGS #Voxels #Physics #madewithunity

Daniel Skaale

28,534 views • 7 months ago

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,792 views • 2 years ago

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share **PhysTwin**: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines **Gaussian splatting** with **inverse dynamics optimization** based on simple **spring-mass** systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

📢 Our lab has been exploring 3D world models for years — and we’re thrilled to share PhysTwin: a milestone that reconstructs object appearance, geometry, and dynamics from just a few seconds of interaction! Led by the amazing Hanxiao Jiang 👉 PhysTwin combines Gaussian splatting with inverse dynamics optimization based on simple spring-mass systems. ⚙️ The result? Real-time, action-conditioned 3D video prediction under novel interactions (i.e., 3D world models). 🔑 A few key takeaways: 1. Having the right structure (e.g., particles/masses) helps navigate the trade-off between sample efficiency, generalization, and broad applicability. 2. Visual foundation models (VFMs) have matured to the point where they can provide rich supervision for world modeling (e.g., tracking, shape completion). 3. Beyond VFMs, many crucial components have come together in recent years: Gaussian splats for rendering, NVIDIA Warp for high-performance simulation, and scene/asset generation from a wide range of labs and companies. The future of 3D world models is looking bright! ✨ 4. The resulting digital twin supports a wide range of downstream applications—especially in data generation and policy evaluation, thanks to its realistic rendering and simulation capabilities. 🎥 All code and data to reproduce the results, along with interactive demos, are available on the website. Check the following visualizations of: (1) observations, (2) reconstructed state/actions, (3) interactive digital twins, and (4) the overlays between real-world robot teleoperation and our model’s open-loop predictions.

Yunzhu Li

25,279 views • 1 year ago

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,849 views • 1 year ago

Apple just trained a 3D Gaussian head reconstruction model on 10,000+ subjects. Feed-forward. No test-time optimization. New identity in, reconstructed Gaussian head out. The UV-parameterized Gaussian representation decouples the number of Gaussians from the number and resolution of input images, making it practical to train with many high resolution views. And the heads are not just static either: text-conditioned identity generation, plus blendshape-driven latent animation across identities. We've been building in the 3D Gaussian Splatting space for a while. The gap between "research demo" and "works on real people at scale" is closing fast.

Apple just trained a 3D Gaussian head reconstruction model on 10,000+ subjects. Feed-forward. No test-time optimization. New identity in, reconstructed Gaussian head out. The UV-parameterized Gaussian representation decouples the number of Gaussians from the number and resolution of input images, making it practical to train with many high resolution views. And the heads are not just static either: text-conditioned identity generation, plus blendshape-driven latent animation across identities. We've been building in the 3D Gaussian Splatting space for a while. The gap between "research demo" and "works on real people at scale" is closing fast.

KIRI Engine - 3D Scanner App

12,181 views • 2 months ago

Imagine making 2D concept art for a game world –pressing a button – and suddenly you can walk around an interactive 3D world. That's what Google DeepMind's new paper Genie 2 can do – simulate virtual worlds, including the consequences of any action (e.g. unlock door, jump, swim etc). Right now Genie 2 can generate consistent worlds for up to a minute. And this world model seems to generate larger 3D worlds than what World Labs showcased yesterday. Plus they're dynamic vs. static worlds – the foliage moves in the wind, the water ripples etc. Not quite ready for prime time, but promising on two fronts: 1. For game developers: enabling rapid prototyping of interactive experiences straight from concept art 2. For AI research: providing unlimited, diverse 3D environments for training and testing AI agents The race for building the biggest, baddest world model is very much on. Meanwhile, all I can think is "if only Stadia was still around!"

Imagine making 2D concept art for a game world –pressing a button – and suddenly you can walk around an interactive 3D world. That's what Google DeepMind's new paper Genie 2 can do – simulate virtual worlds, including the consequences of any action (e.g. unlock door, jump, swim etc). Right now Genie 2 can generate consistent worlds for up to a minute. And this world model seems to generate larger 3D worlds than what World Labs showcased yesterday. Plus they're dynamic vs. static worlds – the foliage moves in the wind, the water ripples etc. Not quite ready for prime time, but promising on two fronts: 1. For game developers: enabling rapid prototyping of interactive experiences straight from concept art 2. For AI research: providing unlimited, diverse 3D environments for training and testing AI agents The race for building the biggest, baddest world model is very much on. Meanwhile, all I can think is "if only Stadia was still around!"

Bilawal Sidhu

71,326 views • 1 year ago

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Michael Black

22,182 views • 7 months ago

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

AI Is Moving Beyond “Generating Videos” — Toward “Generating Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:

雪踏乌云

112,114 views • 17 days ago

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 views • 1 year ago

ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies Contributions: 1) We propose ImmerseGen, a novel agent-guided 3D environment generation framework. It uses simplified geometric proxies with alpha-textured meshes to produce compact, photorealistic worlds ready for real-time mobile VR rendering. 2) We propose a novel RGBA texturing paradigm. It first synthesizes 8K terrain textures using a geometry-conditioned panorama generator via user-centric mapping, and then directly generates alpha-textured proxy assets, avoiding fidelity loss typically resulting from mesh decimation. 3) To automate scene creation from user prompts, we introduce VLM-based modeling agents equipped with a novel grid-based semantic analysis. This enables 3D spatial reasoning from 2D observations and ensures accurate asset placement. ImmerseGen further enhances immersion with dynamic effects and ambient audio for a multisensory experience. 4) Experiments on multiple scene-generation scenarios and live mobile VR applications show that ImmerseGen outperforms previous methods in visual quality, realism, spatial coherence, and rendering efficiency for immersive real-time VR experiences.

ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies Contributions: 1) We propose ImmerseGen, a novel agent-guided 3D environment generation framework. It uses simplified geometric proxies with alpha-textured meshes to produce compact, photorealistic worlds ready for real-time mobile VR rendering. 2) We propose a novel RGBA texturing paradigm. It first synthesizes 8K terrain textures using a geometry-conditioned panorama generator via user-centric mapping, and then directly generates alpha-textured proxy assets, avoiding fidelity loss typically resulting from mesh decimation. 3) To automate scene creation from user prompts, we introduce VLM-based modeling agents equipped with a novel grid-based semantic analysis. This enables 3D spatial reasoning from 2D observations and ensures accurate asset placement. ImmerseGen further enhances immersion with dynamic effects and ambient audio for a multisensory experience. 4) Experiments on multiple scene-generation scenarios and live mobile VR applications show that ImmerseGen outperforms previous methods in visual quality, realism, spatial coherence, and rendering efficiency for immersive real-time VR experiences.

MrNeRF

14,225 views • 1 year ago

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Shikun Liu

22,389 views • 10 months ago

✨ From Mesh to 3D Gaussians - Shaded in Deferred Style ✨ By Stefano Scolari (with link to repo in the comment) When converting a mesh to 3DGS model, the Gaussians are embedded with PBR properties and normal info during conversion. This prepares them for shading, similar to traditional geometry. In this scenario, Stefano used a deferred approach for shading to demonstrate how seamlessly it translates from triangle-based pipelines. 🔥 Fun fact: The mesh-to-splat conversion only needs to occur once, but it's so fast you could actually run it every frame. Here’s the full pipeline: - Mesh-to-splat conversion pass - Depth pre-pass - Sort - Gaussian pre-pass (projective computations, etc.) - GBuffer rendering pass - Shadow pass - Deferred lighting pass This is the same classic deferred pipeline, but with Gaussians instead of triangles. Real-time relighting of converted assets? Totally possible.

✨ From Mesh to 3D Gaussians - Shaded in Deferred Style ✨ By Stefano Scolari (with link to repo in the comment) When converting a mesh to 3DGS model, the Gaussians are embedded with PBR properties and normal info during conversion. This prepares them for shading, similar to traditional geometry. In this scenario, Stefano used a deferred approach for shading to demonstrate how seamlessly it translates from triangle-based pipelines. 🔥 Fun fact: The mesh-to-splat conversion only needs to occur once, but it's so fast you could actually run it every frame. Here’s the full pipeline: - Mesh-to-splat conversion pass - Depth pre-pass - Sort - Gaussian pre-pass (projective computations, etc.) - GBuffer rendering pass - Shadow pass - Deferred lighting pass This is the same classic deferred pipeline, but with Gaussians instead of triangles. Real-time relighting of converted assets? Totally possible.

MrNeRF

19,768 views • 1 year ago

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

Self-Calibrating Gaussian Splatting for Large Field of View Reconstruction Note: Check below for full video. Abstract (cited): "In this paper, we present a self-calibrating framework that jointly optimizes camera parameters, lens distortion, and 3D Gaussian representations, enabling accurate and efficient scene reconstruction. Our technique is particularly effective for high-quality scene reconstruction from large field-of-view (FOV) imagery taken with wide-angle lenses, allowing the scene to be modeled from a smaller number of images. We introduce a novel method for modeling complex lens distortions using a hybrid network that combines invertible residual networks with explicit grids. This design effectively regularizes the optimization process, achieving greater accuracy than conventional camera models. Additionally, we propose a cubemap-based resampling strategy to support large FOV images without sacrificing resolution or introducing distortion artifacts. Our method is compatible with the fast rasterization of Gaussian Splatting, adaptable to a wide variety of camera lens distortions, and demonstrates state-of-the-art performance on both synthetic and real-world datasets."

MrNeRF

17,206 views • 1 year ago

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are **NOT** from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

1/ World models are getting popular in robotics 🤖✨ But there’s a big problem: most are slow and break physical consistency over long horizons. 2/ Today we’re releasing Interactive World Simulator: An action-conditioned world model that supports stable long-horizon interaction. 3/ Key result: ✅ 10+ minutes of interactive prediction ✅ 15 FPS ✅ on a single RTX 4090🔥 4/ Why this matters: it unlocks two critical robotics applications: 🚀 Scalable data generation for policy training 🧪 Faithful policy evaluation 5/ You can play with our world model NOW at NO git clone, NO pip install, NO python. Just click and play! NOTE ⚠️ ALL videos here are generated purely by our model in pixel space! They are NOT from a real camera More details coming 👇 (1/9) #Robotics #AI #MachineLearning #WorldModels #RobotLearning #ImitationLearning

Yixuan Wang

128,746 views • 5 months ago

This week is already so hot. 🔥 Massive release from Decart : Lucy 2.0 a World Editing Model running at 1080p, 30FPS in realtime. This is truly exciting, the era of real-time generative reality is here. We are moving from watching AI video to living inside AI video. A breakthrough model capable of transforming the visual world in real-time. Moving beyond offline rendering, Lucy 2.0 delivers high-fidelity 1080p video generation with near-zero latency. Lucy 2.0 literally "redraws" the entire world pixel-by-pixel, while you are watching it. e.g. If you want to be an anime character, it doesn't just put a mask on you. It turns your skin into anime skin, your hair into anime hair, and the lighting in your room into anime lighting. Lucy 2.0 is also trained to stop the generated video from slowly falling apart over time, so the same stream can run much longer without faces and details drifting. So why is this a "Massive Deal"? Traditional AI video-generation model takes a prompt, you wait 10–20 minutes, and the computer "bakes" a video for you. You couldn't touch it or change it while it was happening. But Lucy 2.0 works like a mirror. It happens in real-time (30 frames per second). There is no waiting. You move your hand, the AI character moves its hand instantly. The craziest part isn't the visuals; it's the physics. Usually, AI hallucinations are glitchy—hands merge into faces, walls melt. Lucy 2.0 understands how the world works without being told. It knows that if you take off a helmet, there is hair underneath. It knows that if you splash water, droplets fly. It learned "physics" just by watching millions of videos. The physical behavior you see emerges from learned visual dynamics, not from engineered geometry or explicit physics engines. Their official technical report explicitly states that the model does not use traditional 3D engines, depth maps, or wireframes. It is a "pure diffusion model."

This week is already so hot. 🔥 Massive release from Decart : Lucy 2.0 a World Editing Model running at 1080p, 30FPS in realtime. This is truly exciting, the era of real-time generative reality is here. We are moving from watching AI video to living inside AI video. A breakthrough model capable of transforming the visual world in real-time. Moving beyond offline rendering, Lucy 2.0 delivers high-fidelity 1080p video generation with near-zero latency. Lucy 2.0 literally "redraws" the entire world pixel-by-pixel, while you are watching it. e.g. If you want to be an anime character, it doesn't just put a mask on you. It turns your skin into anime skin, your hair into anime hair, and the lighting in your room into anime lighting. Lucy 2.0 is also trained to stop the generated video from slowly falling apart over time, so the same stream can run much longer without faces and details drifting. So why is this a "Massive Deal"? Traditional AI video-generation model takes a prompt, you wait 10–20 minutes, and the computer "bakes" a video for you. You couldn't touch it or change it while it was happening. But Lucy 2.0 works like a mirror. It happens in real-time (30 frames per second). There is no waiting. You move your hand, the AI character moves its hand instantly. The craziest part isn't the visuals; it's the physics. Usually, AI hallucinations are glitchy—hands merge into faces, walls melt. Lucy 2.0 understands how the world works without being told. It knows that if you take off a helmet, there is hair underneath. It knows that if you splash water, droplets fly. It learned "physics" just by watching millions of videos. The physical behavior you see emerges from learned visual dynamics, not from engineered geometry or explicit physics engines. Their official technical report explicitly states that the model does not use traditional 3D engines, depth maps, or wireframes. It is a "pure diffusion model."

Rohan Paul

12,761 views • 6 months ago

Two weeks ago I fixed one of my teeth with algorithms I wrote a couple of years ago! I got hooked by 3D scanning when I started to work for a software shop in Zurich that was programming 3D computational geometry algorithms for denture scanning to produce crowns (and more). Back then, a typical reconstruction pipeline was like: scan the patient’s teeth using an intraoral scanner, reconstruct the surface mesh, design the restoration digitally, and finally mill the crown out of ceramic. We were working mostly with point clouds and meshes, but it wasn’t just math, it was craftsmanship translated into a digital process. Every micron mattered. You could literally see how a good algorithm meant a better fit in someone’s mouth. Gaussian Splatting isn’t about surface reconstruction, it’s about appearance reconstruction. It doesn’t care about explicit topology, it captures how light interacts with the scene. In a sense, it’s the opposite philosophy of the dental world: instead of modeling what the object is, it models how the object looks. 3D Gaussian Splatting enables applications like training self driving cars, teaching robots to understand their environment, creating virtual worlds, or monitoring real sites. It represents scenes as millions of small Gaussians rendered in real time without the need for meshes or textures. Coming from a world where precision geometry was everything, this shift felt natural. It’s still about reconstruction, but with a different goal: not manufacturing a perfect object, but reproducing how the world actually looks. Two weeks ago I got my first dental crown, made with the same software, reconstruction algorithms, and Swiss precision I once helped develop. I haven’t worked there in two years, but sitting in that chair and seeing the process from the other side was a proud moment. It reminded me why I love this field.

Two weeks ago I fixed one of my teeth with algorithms I wrote a couple of years ago! I got hooked by 3D scanning when I started to work for a software shop in Zurich that was programming 3D computational geometry algorithms for denture scanning to produce crowns (and more). Back then, a typical reconstruction pipeline was like: scan the patient’s teeth using an intraoral scanner, reconstruct the surface mesh, design the restoration digitally, and finally mill the crown out of ceramic. We were working mostly with point clouds and meshes, but it wasn’t just math, it was craftsmanship translated into a digital process. Every micron mattered. You could literally see how a good algorithm meant a better fit in someone’s mouth. Gaussian Splatting isn’t about surface reconstruction, it’s about appearance reconstruction. It doesn’t care about explicit topology, it captures how light interacts with the scene. In a sense, it’s the opposite philosophy of the dental world: instead of modeling what the object is, it models how the object looks. 3D Gaussian Splatting enables applications like training self driving cars, teaching robots to understand their environment, creating virtual worlds, or monitoring real sites. It represents scenes as millions of small Gaussians rendered in real time without the need for meshes or textures. Coming from a world where precision geometry was everything, this shift felt natural. It’s still about reconstruction, but with a different goal: not manufacturing a perfect object, but reproducing how the world actually looks. Two weeks ago I got my first dental crown, made with the same software, reconstruction algorithms, and Swiss precision I once helped develop. I haven’t worked there in two years, but sitting in that chair and seeing the process from the other side was a proud moment. It reminded me why I love this field.

MrNeRF

290,140 views • 9 months ago

Robotics keeps hitting the same wall. Single task RL works, but... it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and Newt, a language conditioned world model trained online across all 200 tasks at once. The simple idea behind Newt: The model learns from demos to get the right priors It trains across many tasks through online interaction It uses language to ground the goal It adapts fast when a new task shows up What stood out to me: ✅ One model trained on 200 tasks at the same time ✅ Language conditioned control for both states and RGB ✅ Better data efficiency than strong baselines ✅ Strong open loop control ✅ Fast adaptation to new tasks and embodiments ✅ Full release of 200 checkpoints, 4000 demos, code, and benchmark This is a good push toward general control instead of one model per task. If you want the full paper: Project page: —- Weekly robotics and AI insights. Subscribe free:

Robotics keeps hitting the same wall. Single task RL works, but... it does not scale to hundreds of tasks or new embodiments. This new paper looks like a real step toward fixing that. The team introduces MMBench, a benchmark with 200 tasks across many domains and robots, and Newt, a language conditioned world model trained online across all 200 tasks at once. The simple idea behind Newt: The model learns from demos to get the right priors It trains across many tasks through online interaction It uses language to ground the goal It adapts fast when a new task shows up What stood out to me: ✅ One model trained on 200 tasks at the same time ✅ Language conditioned control for both states and RGB ✅ Better data efficiency than strong baselines ✅ Strong open loop control ✅ Fast adaptation to new tasks and embodiments ✅ Full release of 200 checkpoints, 4000 demos, code, and benchmark This is a good push toward general control instead of one model per task. If you want the full paper: Project page: —- Weekly robotics and AI insights. Subscribe free:

Ilir Aliu

70,090 views • 8 months ago

The architecture of this new world model is one of the most interesting things I've seen lately: Let me first explain how most world models work: They predict and render one frame at a time. If you are navigating in one of these worlds, and you look left, the model draws whatever looks right in the moment. Every time you change your viewpoint, the model has to imagine what should be there again, so it's very common for these models to "forget" what's in the world. For example, if you put a toy on the table, look away, then look back, the toy might not be there anymore. Tripo AI is releasing its Project Eden model, which works very differently: The model builds the world first, and then renders it based on that map. That map holds the real state of the world: the geometry, every object, where things are, what's already happened. The picture you see on screen gets generated from the map. This architecture flips the whole thing. Now, you get the following: 1. The world stops forgetting. Leave, come back, and the toy is still on the table because it lives in the map, not in the last frame you saw. 2. You can edit the world, and those changes persist for anyone who enters later. 3. Multiple people and AI agents can coexist in the world and see it from different perspectives. This is early research, but it's looking really promising. They just raised nearly $200M across two rounds to build it out. Tripo will be at SIGGRAPH 2026 (July 19–23, Los Angeles Convention Center). If you work in 3D, embodied AI, simulation, or anything spatial, go connect with them there.

The architecture of this new world model is one of the most interesting things I've seen lately: Let me first explain how most world models work: They predict and render one frame at a time. If you are navigating in one of these worlds, and you look left, the model draws whatever looks right in the moment. Every time you change your viewpoint, the model has to imagine what should be there again, so it's very common for these models to "forget" what's in the world. For example, if you put a toy on the table, look away, then look back, the toy might not be there anymore. Tripo AI is releasing its Project Eden model, which works very differently: The model builds the world first, and then renders it based on that map. That map holds the real state of the world: the geometry, every object, where things are, what's already happened. The picture you see on screen gets generated from the map. This architecture flips the whole thing. Now, you get the following: 1. The world stops forgetting. Leave, come back, and the toy is still on the table because it lives in the map, not in the last frame you saw. 2. You can edit the world, and those changes persist for anyone who enters later. 3. Multiple people and AI agents can coexist in the world and see it from different perspectives. This is early research, but it's looking really promising. They just raised nearly $200M across two rounds to build it out. Tripo will be at SIGGRAPH 2026 (July 19–23, Los Angeles Convention Center). If you work in 3D, embodied AI, simulation, or anything spatial, go connect with them there.

Santiago

30,189 views • 1 month ago

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

Depth Any Video with Scalable Synthetic Data AI physicists and chemists continue to make strides in depth estimation from video. Check out this new paper featuring some impressive examples. See the thread for more details (unfortunately no code yet). Abstract: Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse game environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates 0 - even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

MrNeRF

27,428 views • 1 year ago

$Yesterday at Brown University ICERM's workshop on “Agentic Scientific Computing and Scientific Machine Learning” I spoke about “Adaptive Swarms Across Scales”, making the case for scientific AI as systems that can create representations, stress them, fracture them, and enlarge the category in which future representations live. The category here is a composable and breakable working universe of science: data, hypotheses, simulations, measurements, tools, failures, figures, papers, provenance, and the transformations that connect them. Discovery happens when those transformations become executable, inspectable, composable, and capable of changing the world model they operate within. Atomistic modeling gives one category - states, forces, trajectories, observables, boundary conditions, conservation laws. Neural surrogates learn fast morphisms inside or between such categories. But discovery is higher-order: it changes which objects and morphisms are available in the first place: what variables exist, what operations are allowed, what evidence counts, what scale is active, what invariant is being preserved, and what kind of explanation the system is even capable of forming. This is scientific method as adaptive architecture: compression, stress, fracture, recomposition. Fracture matters here because it makes the logic physical: a non-commuting diagram realized in matter. The imposed load, material hierarchy, defect field, and assumed continuum description no longer map cleanly into the observed outcome. The crack is the obstruction and it identifies where the old morphism failed and where a new representation must be introduced. The physical crack and the categorical obstruction are the same event viewed in different substrates. ScienceClaw × Infinite is a machine for constructing and transforming a category of scientific artifacts. Each artifact is typed. Each operation has lineage. Each failed branch remains in the category as reusable structure. The “paper” is no longer the terminal object of science; it is one projection of a larger compositional trace, and it can be generated at any time for consumption by a human or an AI. With that the unit of scientific labor is changing. For most of the twentieth century the unit was the result (a measurement, a theorem, a synthesized molecule). It is now becoming the algorithm that produces results, and after that, the substrate of discovery itself. The static PDF is the wrong terminal object for this regime, and the role of the scientist with it. We now design algorithms that build algorithms, and eventually substrates in which such algorithms compose themselves. At that point, the scientist is no longer outside the discovery system. The scientist becomes one of the representations the system can transform. In that sense, the systems will eventually do science to us, and that is the structural consequence of the principle they are built on.$

Yesterday at Brown University ICERM's workshop on “Agentic Scientific Computing and Scientific Machine Learning” I spoke about “Adaptive Swarms Across Scales”, making the case for scientific AI as systems that can create representations, stress them, fracture them, and enlarge the category in which future representations live. The category here is a composable and breakable working universe of science: data, hypotheses, simulations, measurements, tools, failures, figures, papers, provenance, and the transformations that connect them. Discovery happens when those transformations become executable, inspectable, composable, and capable of changing the world model they operate within. Atomistic modeling gives one category - states, forces, trajectories, observables, boundary conditions, conservation laws. Neural surrogates learn fast morphisms inside or between such categories. But discovery is higher-order: it changes which objects and morphisms are available in the first place: what variables exist, what operations are allowed, what evidence counts, what scale is active, what invariant is being preserved, and what kind of explanation the system is even capable of forming. This is scientific method as adaptive architecture: compression, stress, fracture, recomposition. Fracture matters here because it makes the logic physical: a non-commuting diagram realized in matter. The imposed load, material hierarchy, defect field, and assumed continuum description no longer map cleanly into the observed outcome. The crack is the obstruction and it identifies where the old morphism failed and where a new representation must be introduced. The physical crack and the categorical obstruction are the same event viewed in different substrates. ScienceClaw × Infinite is a machine for constructing and transforming a category of scientific artifacts. Each artifact is typed. Each operation has lineage. Each failed branch remains in the category as reusable structure. The “paper” is no longer the terminal object of science; it is one projection of a larger compositional trace, and it can be generated at any time for consumption by a human or an AI. With that the unit of scientific labor is changing. For most of the twentieth century the unit was the result (a measurement, a theorem, a synthesized molecule). It is now becoming the algorithm that produces results, and after that, the substrate of discovery itself. The static PDF is the wrong terminal object for this regime, and the role of the scientist with it. We now design algorithms that build algorithms, and eventually substrates in which such algorithms compose themselves. At that point, the scientist is no longer outside the discovery system. The scientist becomes one of the representations the system can transform. In that sense, the systems will eventually do science to us, and that is the structural consequence of the principle they are built on.

Markus J. Buehler

10,095 views • 2 months ago