Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Diffusion Transformers aren't just generative models, but also powerful multi-modal encoders. ConceptAttention creates rich heatmaps of text concepts in images from DiT representations. This even works on real images, and can be applied to tasks like segmentation! Demo 👇

Alec Helbling

10,787 subscribers

24,419 views • 1 year ago •via X (Twitter)

Science & Technology Education

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

AK

41,746 views • 3 years ago

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

JUST IN: Meta AI introduces Voicebox, an all-in-one generative speech model. Voicebox is an impressive breakthrough! It could do for speech what other models like GPT-3 and Stable Diffusion have done for text and images. Some key details: - Voicebox can synthesize speech across 6 languages - It's a general-purpose model that can perform tasks it wasn't trained on. It can perform noise removal, content editing, style conversion, and more - Supports in-context text-to-speech synthesis and cross-lingual style transfer - It's 20x faster than current models and outperforms single-purpose models through in-context learning paper: blog:

elvis

88,512 views • 3 years ago

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers paper page: Recent advances in generative AI have significantly enhanced image and video editing, particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However, the computational demands of diffusion-based methods are substantial, often necessitating large-scale paired datasets for training, and therefore challenging the deployment in practical applications. This study addresses this challenge by breaking down the text-based video editing process into two separate stages. In the first stage, we leverage an existing text-to-image diffusion model to simultaneously edit a few keyframes without additional fine-tuning. In the second stage, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the keyframes, benefiting from structural guidance provided by intermediate frames. Our comprehensive set of experiments illustrates the efficacy and efficiency of MaskINT when compared to other diffusion-based methodologies. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain.

AK

25,449 views • 2 years ago

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,792 views • 2 years ago

kling 3.0 is crazy... this model works very different than any other model this is from giving it just 2 images and a multi prompt then it handled all the scenes by itself... now imagine what it could do if you gave it more references and an even more detailed multi prompt going to test this a lot more but kling definetly cooked on this one

kling 3.0 is crazy... this model works very different than any other model this is from giving it just 2 images and a multi prompt then it handled all the scenes by itself... now imagine what it could do if you gave it more references and an even more detailed multi prompt going to test this a lot more but kling definetly cooked on this one

Miko

95,650 views • 5 months ago

2023 was the year of AI avatars 2024 was the year of AI photos 2025 was the year of AI videos And I think it's becoming clear now that 2026 will be the year of AI world models Fully interactive explorable 3d worlds generated from one or multiple 2d images or a prompt In turn these 2d images can then be generated by AI too So soon you can generate fully explorable virtual 3d worlds based on your own imagination Next will be figuring out how to make those worlds interactive This is World Labs (unaffiliated, but I like it) As always a lot of big AI model companies are now working on the same thing: 3d world models, only World Labs has a real properly working demo (for now) Very exciting time again!

2023 was the year of AI avatars 2024 was the year of AI photos 2025 was the year of AI videos And I think it's becoming clear now that 2026 will be the year of AI world models Fully interactive explorable 3d worlds generated from one or multiple 2d images or a prompt In turn these 2d images can then be generated by AI too So soon you can generate fully explorable virtual 3d worlds based on your own imagination Next will be figuring out how to make those worlds interactive This is World Labs (unaffiliated, but I like it) As always a lot of big AI model companies are now working on the same thing: 3d world models, only World Labs has a real properly working demo (for now) Very exciting time again!

@levelsio

584,228 views • 10 months ago

Pothole detection on the road in real time using Ultralytics YOLO26! 🕳️ Manual road inspections are slow, costly, and hard to scale. With object detection, potholes can be identified directly from street-level images or video feeds, enabling faster and more consistent road condition monitoring. How I built this demo: ✅ Trained a segmentation model on a custom dataset. ✅ Generated mask contours for each pothole. ✅ Leveraged the onnx-exported model for faster processing. #Pothole #RoadDamage #AI

Pothole detection on the road in real time using Ultralytics YOLO26! 🕳️ Manual road inspections are slow, costly, and hard to scale. With object detection, potholes can be identified directly from street-level images or video feeds, enabling faster and more consistent road condition monitoring. How I built this demo: ✅ Trained a segmentation model on a custom dataset. ✅ Generated mask contours for each pothole. ✅ Leveraged the onnx-exported model for faster processing. #Pothole #RoadDamage #AI

Muhammad Rizwan Munawar

30,677 views • 4 months ago

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:

AK

154,511 views • 3 years ago

Excited to share a few presentations, demos, and workshop talks from our group and collaborators at #ICRA2026! We will present recent work on real-to-sim-to-real robot policy evaluation, model-based planning with learned dynamics, and multi-modal manipulation. We will also have a joint live demo between SceniX and Analog Devices, Inc. on real-to-sim-to-real cable manipulation at the ICRA exhibition. This is a small teaser of what we have been building, with more to come soon! If you are at ICRA, please stop by the sessions or the demo booth. Happy to chat about robot learning, simulation, world models, and sim-to-real!

Excited to share a few presentations, demos, and workshop talks from our group and collaborators at #ICRA2026! We will present recent work on real-to-sim-to-real robot policy evaluation, model-based planning with learned dynamics, and multi-modal manipulation. We will also have a joint live demo between SceniX and Analog Devices, Inc. on real-to-sim-to-real cable manipulation at the ICRA exhibition. This is a small teaser of what we have been building, with more to come soon! If you are at ICRA, please stop by the sessions or the demo booth. Happy to chat about robot learning, simulation, world models, and sim-to-real!

Yunzhu Li

10,855 views • 1 month ago

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.

Michael Black

22,182 views • 7 months ago

This AI just turned me into a film director… No editing skills. No timeline headaches. Just one prompt. This is Seedance 2.0 🎬 You can literally combine: → Text → Images → Videos → Audio And it understands everything. Even crazier? You can control it like this: Image → character Video → camera movement audio1 → music/voice It doesn’t just generate clips… It builds full cinematic scenes with: → Consistent characters → Smooth transitions → Realistic motion → Built-in lip sync Basically… From a single prompt → you get a multi-shot story. Not AI video. AI filmmaking. Go try it before everyone catches on 👇

This AI just turned me into a film director… No editing skills. No timeline headaches. Just one prompt. This is Seedance 2.0 🎬 You can literally combine: → Text → Images → Videos → Audio And it understands everything. Even crazier? You can control it like this: Image → character Video → camera movement audio1 → music/voice It doesn’t just generate clips… It builds full cinematic scenes with: → Consistent characters → Smooth transitions → Realistic motion → Built-in lip sync Basically… From a single prompt → you get a multi-shot story. Not AI video. AI filmmaking. Go try it before everyone catches on 👇

Kshitij Mishra | AI & Tech

60,393 views • 3 months ago

Microsoft presents Windows Agent Arena Evaluating Multi-Modal OS Agents at Scale discuss: Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena.

Microsoft presents Windows Agent Arena Evaluating Multi-Modal OS Agents at Scale discuss: Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena.

AK

19,684 views • 1 year ago

OpenAI's Deep Research is getting a run for its money. Deep Lake was just released, and it's a different take on an AI system that can do deep research on your own data. You can use Deep Lake to build AI search with reasoning on your private and public data. (Look at the attached videos to get an idea of how it works.) If you want to research proprietary and sensitive data, Deep Research won't help you because it's limited to public data. Deep Lake, however, will allow you to use your private data. On top of that, Deep Lake supports multi-modal retrieval from the ground up. It uses vision language models for data ingestion and retrieval so that you can connect any data (PDFs, images, videos, structured data, etc.) You can even use mixed-data queries! Deep Lake can search your data from S3, Dropbox, and GCP. It learns from your queries over time, making the results as relevant to your work as possible!

OpenAI's Deep Research is getting a run for its money. Deep Lake was just released, and it's a different take on an AI system that can do deep research on your own data. You can use Deep Lake to build AI search with reasoning on your private and public data. (Look at the attached videos to get an idea of how it works.) If you want to research proprietary and sensitive data, Deep Research won't help you because it's limited to public data. Deep Lake, however, will allow you to use your private data. On top of that, Deep Lake supports multi-modal retrieval from the ground up. It uses vision language models for data ingestion and retrieval so that you can connect any data (PDFs, images, videos, structured data, etc.) You can even use mixed-data queries! Deep Lake can search your data from S3, Dropbox, and GCP. It learns from your queries over time, making the results as relevant to your work as possible!

Santiago

171,340 views • 1 year ago

Apple just trained a 3D Gaussian head reconstruction model on 10,000+ subjects. Feed-forward. No test-time optimization. New identity in, reconstructed Gaussian head out. The UV-parameterized Gaussian representation decouples the number of Gaussians from the number and resolution of input images, making it practical to train with many high resolution views. And the heads are not just static either: text-conditioned identity generation, plus blendshape-driven latent animation across identities. We've been building in the 3D Gaussian Splatting space for a while. The gap between "research demo" and "works on real people at scale" is closing fast.

Apple just trained a 3D Gaussian head reconstruction model on 10,000+ subjects. Feed-forward. No test-time optimization. New identity in, reconstructed Gaussian head out. The UV-parameterized Gaussian representation decouples the number of Gaussians from the number and resolution of input images, making it practical to train with many high resolution views. And the heads are not just static either: text-conditioned identity generation, plus blendshape-driven latent animation across identities. We've been building in the 3D Gaussian Splatting space for a while. The gap between "research demo" and "works on real people at scale" is closing fast.

KIRI Engine - 3D Scanner App

12,181 views • 2 months ago

Google Nano Banana 🍌 is crazy good at static ads... But it only generates one image at a time. This n8n AI Agent helps you generate 1000s of winning ad variations in minutes, fully automated. → Built with the latest Nano Banana image model → Creates static ad images in bulk → Upload product reference image via n8n form → OpenAI Vision analyzes your product automatically → AI Agent generates custom image prompts (you choose how many) → Nano Banana creates static ad images on demand → Images auto-stored in Box. com for instant access You can request 50, 100, or even 1000 ad variations with one upload. Just specify the number in the form → AI does everything else. Built 100% in n8n. Zero manual work after setup. Want access to the template? → Like this post → Comment "ADS" And I'll send it right over.

Google Nano Banana 🍌 is crazy good at static ads... But it only generates one image at a time. This n8n AI Agent helps you generate 1000s of winning ad variations in minutes, fully automated. → Built with the latest Nano Banana image model → Creates static ad images in bulk → Upload product reference image via n8n form → OpenAI Vision analyzes your product automatically → AI Agent generates custom image prompts (you choose how many) → Nano Banana creates static ad images on demand → Images auto-stored in Box. com for instant access You can request 50, 100, or even 1000 ad variations with one upload. Just specify the number in the form → AI does everything else. Built 100% in n8n. Zero manual work after setup. Want access to the template? → Like this post → Comment "ADS" And I'll send it right over.

Mike Futia

209,489 views • 10 months ago

Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full support for parallelism and optimization. Not just to hack⚙️, but also to scale🚀. So finally we built it for ourselves. LMMs-Engine: a lean, efficient framework built to train unified multimodal model at scale. From Qwen LLM, VLM, LLaVA-OV, and WanVideo, to unified models like Qwen-Omni and BAGEL — plus Linear-Attn GDN and research prototypes like RAE and SiT - all under one modular system that seamlessly integrates diverse datasets and optimization strategies. Powered by FSDP2 multi-dim parallelism, Ulysses sequence parallel, Flash-Attention, Liger Kernels, and Native Sparse Attention (also with bonus support for the Muon optimizer for all models).

Throughout my journey in developing multimodal models, I’ve always wanted a framework that lets me plug & play modality encoders/decoders on top of an auto-regressive LLM. I want to prototype fast, try new architectures, and have my demo files scale effortlessly — with full support for parallelism and optimization. Not just to hack⚙️, but also to scale🚀. So finally we built it for ourselves. LMMs-Engine: a lean, efficient framework built to train unified multimodal model at scale. From Qwen LLM, VLM, LLaVA-OV, and WanVideo, to unified models like Qwen-Omni and BAGEL — plus Linear-Attn GDN and research prototypes like RAE and SiT - all under one modular system that seamlessly integrates diverse datasets and optimization strategies. Powered by FSDP2 multi-dim parallelism, Ulysses sequence parallel, Flash-Attention, Liger Kernels, and Native Sparse Attention (also with bonus support for the Muon optimizer for all models).

Brian Li

54,822 views • 9 months ago

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Introducing Kaleido💮 from AI at Meta — a universal generative neural rendering engine for photorealistic, unified object and scene view synthesis. Kaleido is built on a simple but powerful design philosophy: 3D perception is a form of visual common sense. Following this idea, we formulate rendering purely as a sequence-to-sequence generation problem, successfully unifying neural rendering with the architecture principles behind modern language and video models. Unlike traditional neural rendering methods, Kaleido learns 3D purely in a data-driven way, without explicit 3D representations or structures. It acquires spatial understanding directly through large-scale video pretraining, then multi-view 3D data finetuning, inspired by how LLMs acquire textual common sense from large corpora before specialising in domains like coding. Through extensive ablations, we progressively modernised the architecture design and training strategies and tackled key scaling challenges in sequence-to-sequence generative rendering, arriving at a design that’s simple, versatile, and scalable. Kaleido significantly outperforms prior generative models in few-view settings, and remarkably is the first zero-shot generative method matches InstantNGP-level rendering quality in multi-view settings. We view Kaleido also as an alternative step towards world modeling that flexibly spans a spectrum of “realities": with many views, it faithfully reconstructs grounded reality; with fewer views, it imagines plausible unseen details. 🔗 Explore more results and paper:

Shikun Liu

22,332 views • 9 months ago

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

Wonderland: Navigating 3D Scenes from a Single Image Contributions: • First, we introduce a representation for controllable 3D generation by leveraging the generative priors from camera-guided video diffusion models. Unlike image models, video diffusion models are trained on extensive video datasets. This enables them to capture comprehensive spatial relationships within scenes across multiple views and embed a form of "3D awareness" in their latent space, which allows us to maintain 3D consistency in novel view synthesis. • Second, to achieve controllable novel view generation, we empower video models with precise control over specified camera motions. We introduce a novel dual-branch conditioning mechanism that effectively incorporates desired diverse camera trajectories into the video diffusion model. This enables expansion of a single image into a multi-view consistent capture of a 3D scene with precise pose control. • Third, to achieve efficient 3D reconstruction, we directly transform video latents into 3DGS. We propose a novel latent-based large reconstruction model (LaLRM) that lifts video latents to 3D in a feed-forward manner. With this design, during inference, our model directly predicts 3DGS from a single input image, effectively aligning the generation and reconstruction tasks—and bridging image space and 3D space—through the video latent space. Compared with reconstructing scenes from images, the video latent space offers a 256× spatial-temporal reduction while retaining essential and consistent 3D structural details. Such a high degree of compression is crucial, as it allows the LaLRM to handle a wider range of 3D scenes within the reconstruction framework, with the same memory constraints.

MrNeRF

52,801 views • 1 year ago

There's a very peculiar thing that happens when I ask people what their Rich Life is: They start minimizing what they actually want "I guess one day I'd like a beach house...it doesn't have to be big. It doesn't even have to be that close to the beach!" "I'd like to travel. Nothing fancy. Doesn't even have to be for that long. Maybe just a chance to spend a couple days in Italy" Look at this clip from today's podcast episode. A multi-millionaire to-be is telling me how her massage "doesn't even have to be"... I stopped her right there. WHY DO WE SHRINK OUR DREAMS? Even in our description of a Rich Life, we apologize and play small. Most people who do this aren't even aware of it. It's unconscious. Sometimes they grew up in the Midwest, sometimes religious, sometimes they're simply afraid to say they want something big, because if they don't get it...then they failed themselves But if you want to live a Rich Life, you're much more likely to get there by setting a big, audacious vision. Playing small feels safe but very few people get motivated by a vision like "I just want to be debt free" Dream bigger, be specific, and use your money to get there

There's a very peculiar thing that happens when I ask people what their Rich Life is: They start minimizing what they actually want "I guess one day I'd like a beach house...it doesn't have to be big. It doesn't even have to be that close to the beach!" "I'd like to travel. Nothing fancy. Doesn't even have to be for that long. Maybe just a chance to spend a couple days in Italy" Look at this clip from today's podcast episode. A multi-millionaire to-be is telling me how her massage "doesn't even have to be"... I stopped her right there. WHY DO WE SHRINK OUR DREAMS? Even in our description of a Rich Life, we apologize and play small. Most people who do this aren't even aware of it. It's unconscious. Sometimes they grew up in the Midwest, sometimes religious, sometimes they're simply afraid to say they want something big, because if they don't get it...then they failed themselves But if you want to live a Rich Life, you're much more likely to get there by setting a big, audacious vision. Playing small feels safe but very few people get motivated by a vision like "I just want to be debt free" Dream bigger, be specific, and use your money to get there

Ramit Sethi

26,480 views • 2 months ago