Uploaded: 2025-12-15T01:59:07.000Z
Duration: PT13.153S
Channel: Michael Black

Video diffusion models have strong implicit representations of 3D... shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.show more

Michael Black

22,182 views • 7 months ago

The Hidden Language of Diffusion Models paper page: tackle... the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulationshow more

41,746 views • 3 years ago

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The... correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.show more

19,101 views • 1 year ago

DimensionX: Create Any 3D and 4D Scenes from a... Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.show more

MrNeRF

17,047 views • 1 year ago

✨ Made a new mini feature on Photo AI:... [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!show more

@levelsio

119,210 views • 1 year ago

Create a 3D model from a single image, set... of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”show more

Bilawal Sidhu

92,792 views • 2 years ago

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns... 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.show more

Rohan Paul

63,859 views • 10 months ago

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page:... Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.show more

375,123 views • 3 years ago

AI Is Moving Beyond “Generating Videos” — Toward “Generating... Worlds” Over the past two years, AI video models have advanced at an astonishing pace. From Runway and Pika to Sora and Veo, AI-generated videos have become increasingly realistic and more consistent with the physical laws of the real world. Many people believe the next objective is simply to generate videos that are longer, sharper, and more lifelike. But if we take a step back, we can see that the real transformation is not happening in video itself. It is happening in world models. What Is a World Model? In 1943, psychologist Kenneth Craik proposed an idea that would influence artificial intelligence research for decades. He argued that the human brain does not merely react to the outside world. Instead, it maintains an internal model of how the world works. Because we have this internal model, we can predict the outcome of an action before we actually take it. Before crossing a road, we estimate whether a car will pass by. Before catching a ball, we predict its trajectory. These abilities come from continuously simulating the world in our minds, rather than relying entirely on trial and error. This idea later became known by a more formal term: World Model. A world model does not describe a single image or a fixed video clip. It is an internal representation capable of continuously simulating the rules and dynamics of the real world. Why Is AI Research Turning Toward World Models? Because predicting “what comes next” is becoming increasingly central to how AI systems work. Language models predict the next token. Image models predict the next step in the denoising process. Video models predict the next frame. A world model, however, attempts to predict something broader: What should the world look like in the next moment? In 2018, David Ha and Jürgen Schmidhuber proposed in their paper World Models that an intelligent agent could first learn a model of the world, and then use that internal model to plan its actions. The Dreamer series later demonstrated that many complex tasks could be learned by training agents inside an “imagined world.” At the same time, the development of video models such as Sora and Veo led researchers to another realization: A model capable of continuously generating video has already learned, at least implicitly, many of the rules governing the real world. As a result, these two research directions have gradually begun to converge. But Video Is Not Yet a World This is where the distinction is often misunderstood. For a world model to support meaningful real-time interaction, it must solve several critical problems. Most video models today are essentially answering one question: What should the next frame look like? A true world model needs to answer much more: What happens if I take one step forward? If I walk behind a building and then return, will the building still be there? If I suddenly change the camera angle, will the entire space remain consistent? If I enter a command such as: “Summon a dragon.” Will the world respond immediately? In other words, a world model must do more than generate content. It must understand space. It must understand time. It must understand causality. And it must understand interaction. Moving from watching to participating is where the real difficulty of world models begins. World Models Are Entering the Interactive Era One of the latest attempts in this direction is Alaya World, recently open-sourced by Alaya World, or Alaya Lab. Instead of generating a fixed video clip, it generates a world that users can explore in real time. Users can begin with text, an image, or a video, enter the generated scene, move freely through it, and introduce new prompts at any moment during generation. The world responds immediately. According to the publicly released information, Alaya World provides: Real-time streaming generation at 720p and 24 FPS Stable continuous exploration for more than one minute The ability to switch prompts and trigger skills or events during generation Model weights and inference code released under the Apache 2.0 License Training code and datasets planned for future release What makes these capabilities important is not simply the technical specifications. It is that the generated “world” can now support continuous interaction. The official demo shows that users can genuinely control, transform, and explore the generated environment. AI Is Evolving From a Tool Into an Environment Over the past few years, most discussions around AI have focused on content generation. Generating text. Generating images. Generating videos. But world models raise a fundamentally different question: Can AI generate an environment that people can inhabit, explore, and continuously evolve? If the answer is yes, the impact will extend far beyond video generation. Game development, robotics training, embodied intelligence, digital twins, virtual production, and many other fields could be transformed by the development of world models. World models are still at a very early stage. Yet from Craik’s proposal of an internal mental model more than eighty years ago to the emergence of today’s interactive world-generation systems, a clear evolutionary path is beginning to take shape. Perhaps what AI is ultimately learning has never been limited to images, videos, or language. Perhaps it is learning the world itself. References GitHub: Technical Report:show more

雪踏乌云

112,114 views • 6 days ago

From product image to video with just one tool... - Dzine As you may have noticed, this is one of my favorite tools. It is also very underrated, as probably 50% of my tutorials include some workflow. I was testing the new image-to-video option today, and I love it. Step - by step guide in comments 🔽 I can do 95% of a workflow without switching between apps. Image generation, Image to image with style reference, background removal, background generation, and 2 frames image to video. The only other app I have been using for this video is CapCut so that I can stitch it together. Step by step 🔽show more

Teodora P L

28,523 views • 1 year ago

World modeling and imitation learning have largely been considered... two disparate worlds. In our recent work, Unified World Models, just accepted to #RSS2025, Chuning Zhu provides a dead-simple unifying solution: just train a joint diffusion model over actions and future states, but with *decoupled* diffusion time steps across these modalities. Manipulating these decoupled time steps then allows for marginalization or conditioning on actions or states; a single model can serve as a policy, forward dynamics model, video prediction model, or inverse dynamics model by simply setting diffusion timesteps carefully. The resulting model can leverage video datasets along with robot training data much more effectively, and shows improved robustness, generalization, and flexibility. This is exciting because it is frustratingly simple, scalable, and shows strong improvement on real-world robotics problems. Please refer to Chuning Zhu 's excellent thread for more details! More details/code can be found on our website and in the paper -show more

Abhishek Gupta

11,430 views • 1 year ago

✨ Every time the video models get better, the... try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shootshow more

@levelsio

334,170 views • 1 year ago

The one thing we absolutely know with certainty about... the Richat Structure is that there was an incredible amount of human activity here from the dawn of toolmaking and tool use itself. The Acheulean Hand Axe is the second tool that humans ever made and this technology spread throughout Africa, Europe and Asia while no definitive evidence exists that this technology ever made it to the Americas. I think this is evidence that this technology was shared and disseminated intentionally, that neighboring humans didn’t come up with this independently on their own but this was a legacy, an institution, a skill and an industry that was shared and taught. It is perhaps the earliest evidence we have of widespread cultural diffusion throughout Africa, Europe and Asia. And yes, it’s right there at the Richat Structure which leads me to question the role that the Richat played to humans half a million years ago. It seems like that’s where we have to start with the Richat.show more

Archaic Lens

35,584 views • 8 months ago

You can't 3D reconstruct glass from images... ...WRONG! Thanks... for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AIshow more

Jonathan Stephens

17,712 views • 6 months ago

🚨 JUST IN: THIS FREE TOOL JUST REPLACED FOUR... AI IMAGE AND VIDEO SUBSCRIPTIONS AT ONCE. Midjourney. Krea. Higgsfield. Openart. One repo. 200+ models. Zero dollars a month. Here is what it actually does. It is a full image and video studio that runs in your browser or as a desktop app. Text to image, image to image, text to video, image to video, lip sync, cinema mode with real camera controls. All of it. 4,500 people already starred this. What you get for free: → 50+ image models including Flux, Midjourney v7, Ideogram, GPT-4o, Seedream → 60+ video models including Kling, Sora, Veo, Runway, Wan, Hailuo → lip sync studio with 9 dedicated models. upload a portrait and audio and it talks → cinema studio with real camera controls. lens, focal length, aperture, film stock → feed up to 14 reference images into one generation → self-hosted. your data never leaves your machine The crazy part is there is also a hosted version that needs zero setup. Just open the link and start generating. Now the math. Midjourney Standard: $30/month Krea AI Pro: $30/month Higgsfield Plus: $49/month Openart AI: $15/month That is $124 a month. $1,488 a year. This repo does everything all four do. With more models than any of them. For free. Forever. No subscription. No vendor lock-in. MIT licensed. Download it in one click on Mac or Windows. Someone should have told me about this sooner. I feel like an idiot. ( save this )show more

Kanika

14,749 views • 3 months ago

Break-A-Scene: Extracting Multiple Concepts from a Single Image introduce... the task of textual scene decomposition: given a single image of a scene that may contain several concepts, we aim to extract a distinct text token for each concept, enabling fine-grained control over the generated scenes. To this end, we propose augmenting the input image with masks that indicate the presence of target concepts. These masks can be provided by the user or generated automatically by a pre-trained segmentation model. We then present a novel two-phase customization process that optimizes a set of dedicated textual embeddings (handles), as well as the model weights, striking a delicate balance between accurately capturing the concepts and avoiding overfitting. We employ a masked diffusion loss to enable handles to generate their assigned concepts, complemented by a novel loss on cross-attention maps to prevent entanglement. We also introduce union-sampling, a training strategy aimed to improve the ability of combining multiple concepts in generated images. We use several automatic metrics to quantitatively compare our method against several baselines, and further affirm the results using a user study. Finally, we showcase several applications of our method paper page:show more

154,511 views • 3 years ago

Single video → a reframeable 4D Gaussian Splatting scene.... Not a sequence of separately built 3D frames played back like a video. This is one continuous space-time scene, reconstructed from a single clip shot on an iPhone 16. We combine feed-forward Gaussian generation, 3D tracking, and 4D Gaussian Splatting, aiming to deliver it as a compact 4D video file that runs on your phone. Still early R&D. The goal: make 4DGS something anyone can create and experience, not just researchers with a camera rig.show more

KIRI Engine - 3D Scanner App

29,478 views • 6 days ago

The next phase of AI video needs infrastructure, not... more tools.. This isn’t just a product update. It’s a new category. Topview 4.0 is the world’s first AI video collaboration board, built for how creators and teams actually work. One shared workspace to create, review, and iterate together without switching tools. Prompt → Image → Video → Avatar in a single flow. Real-time collaboration. All top image and video models in one place. Creator-friendly pricing, cheaper than official APIs. If AI video is part of your workflow, this is the board it belongs on. TopviewAIshow more

Kuria | AI

12,848 views • 5 months ago

Playing around with World Labs' 3D generator and combining... show more

enigmatic_e

19,774 views • 1 year ago

If you think OpenAI Sora is a creative toy... like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths. I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be! Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee." - The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space. - The 3D objects are consistently animated as they sail and avoid each other's paths. - Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations. - Photorealism, almost like rendering with raytracing. - The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe. - The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect. Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines.show more

Jim Fan

6,182,281 views • 2 years ago

Live Cam