Video diffusion models have strong implicit representations of 3D... shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source. This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.show more

Michael Black
22,004 görüntüleme • 5 ay önce
The Hidden Language of Diffusion Models paper page: tackle... the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulationshow more

AK
41,746 görüntüleme • 3 yıl önce
Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The... correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.show more

AK
19,101 görüntüleme • 1 yıl önce
DimensionX: Create Any 3D and 4D Scenes from a... Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.show more

MrNeRF
17,017 görüntüleme • 1 yıl önce
Text-to-image diffusion transformer models learn to align text and... image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.show more

Alec Helbling
94,065 görüntüleme • 5 ay önce
✨ Made a new mini feature on Photo AI:... [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!show more

@levelsio
119,210 görüntüleme • 11 ay önce
Create a 3D model from a single image, set... of images or a text prompt in < 1 minute 😮💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”show more

Bilawal Sidhu
92,760 görüntüleme • 2 yıl önce
🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video... Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5show more

Haiwen (Haven) Feng
22,392 görüntüleme • 11 ay önce
the fact that i can take an image of... a room and turn it into a 3d model in one shot is actually insane this took like 30 seconds from image to 3d modelshow more

Jan
131,884 görüntüleme • 6 ay önce
🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns... 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.show more

Rohan Paul
63,859 görüntüleme • 9 ay önce
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page:... Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.show more

AK
375,077 görüntüleme • 3 yıl önce
This AI model turns a single image + audio... into a talking avatar video of infinite length. StableAvatar is built on Wan 2.1, and MIT Licensed. Open-source AI video is accelerating, and it’s exciting to watch!show more

Miguel | AP
35,804 görüntüleme • 9 ay önce
From product image to video with just one tool... - Dzine As you may have noticed, this is one of my favorite tools. It is also very underrated, as probably 50% of my tutorials include some workflow. I was testing the new image-to-video option today, and I love it. Step - by step guide in comments 🔽 I can do 95% of a workflow without switching between apps. Image generation, Image to image with style reference, background removal, background generation, and 2 frames image to video. The only other app I have been using for this video is CapCut so that I can stitch it together. Step by step 🔽show more

Teodora P L
28,523 görüntüleme • 1 yıl önce
Placing objects sounds simple… until robots have to do... it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !show more

Ilir Aliu - eu/acc
22,843 görüntüleme • 1 yıl önce
Combining the explicit control of 3D software with the... creativity of generative AI models is a promising yet underrated workflow. Build your 3D scenes procedurally by describing them in natural language, then take them all the way with your image & video models of choice. Tools like intangible are built around such a workflow so you don't need to duct-tape apps together. Pretty cool!show more

Bilawal Sidhu
37,544 görüntüleme • 10 ay önce
Code of #ACEZero is out. A new approach to... SfM. Learn the 3D scene without image-to-image matching. Naturally avoids the explosion of complexity for many images. ACE0 shines if you have dense coverage of a scene. Posing 10k images and more? Sure!show more

Eric Brachmann
23,950 görüntüleme • 1 yıl önce
World modeling and imitation learning have largely been considered... two disparate worlds. In our recent work, Unified World Models, just accepted to #RSS2025, Chuning Zhu provides a dead-simple unifying solution: just train a joint diffusion model over actions and future states, but with *decoupled* diffusion time steps across these modalities. Manipulating these decoupled time steps then allows for marginalization or conditioning on actions or states; a single model can serve as a policy, forward dynamics model, video prediction model, or inverse dynamics model by simply setting diffusion timesteps carefully. The resulting model can leverage video datasets along with robot training data much more effectively, and shows improved robustness, generalization, and flexibility. This is exciting because it is frustratingly simple, scalable, and shows strong improvement on real-world robotics problems. Please refer to Chuning Zhu 's excellent thread for more details! More details/code can be found on our website and in the paper -show more

Abhishek Gupta
11,388 görüntüleme • 1 yıl önce
✨ Every time the video models get better, the... try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shootshow more

@levelsio
331,475 görüntüleme • 1 yıl önce
This is a 🔥 launch for AI image generation!... I uploaded four photos of myself to KREA AI and got a model that generates real-time images of me. Adjust the prompt and you'll immediately see a change - watch me travel around the world ✨show more

Justine Moore
76,200 görüntüleme • 1 yıl önce
🚨 JUST IN: THIS FREE TOOL JUST REPLACED FOUR... AI IMAGE AND VIDEO SUBSCRIPTIONS AT ONCE. Midjourney. Krea. Higgsfield. Openart. One repo. 200+ models. Zero dollars a month. Here is what it actually does. It is a full image and video studio that runs in your browser or as a desktop app. Text to image, image to image, text to video, image to video, lip sync, cinema mode with real camera controls. All of it. 4,500 people already starred this. What you get for free: → 50+ image models including Flux, Midjourney v7, Ideogram, GPT-4o, Seedream → 60+ video models including Kling, Sora, Veo, Runway, Wan, Hailuo → lip sync studio with 9 dedicated models. upload a portrait and audio and it talks → cinema studio with real camera controls. lens, focal length, aperture, film stock → feed up to 14 reference images into one generation → self-hosted. your data never leaves your machine The crazy part is there is also a hosted version that needs zero setup. Just open the link and start generating. Now the math. Midjourney Standard: $30/month Krea AI Pro: $30/month Higgsfield Plus: $49/month Openart AI: $15/month That is $124 a month. $1,488 a year. This repo does everything all four do. With more models than any of them. For free. Forever. No subscription. No vendor lock-in. MIT licensed. Download it in one click on Mac or Windows. Someone should have told me about this sooner. I feel like an idiot. ( save this )show more

Kanika
14,656 görüntüleme • 1 ay önce
You can't 3D reconstruct glass from images... ...WRONG! Thanks... for video diffusion, now just about anything is possible! Introducing...Diffusion Knows Transparency (DKT) Transparent and reflective objects usually break robot vision and photogrammetry pipelines because they don't follow the "solid object" rules standard cameras expect. DKT is a new AI model that repurposes the "internal physics engine" found in video generation models to solve this problem. Researchers took a massive video diffusion model (WAN) and fine-tuned it using a custom-built synthetic dataset to turn it into a high-precision depth sensor. To train the AI, they built the first massive synthetic video library of transparent objects, 1.32 million frames of perfectly labeled glass and metal objects in motion. Without ever seeing a "real" labeled video of glass during training, the model (DKT) outperformed all previous specialized systems on real-world benchmarks (ClearPose, DREDS). They created a "lightweight" 1.3B parameter version that runs fast enough (0.17s per frame) to be used on actual robot hardware. Two reasons I find this project important: 1. It further proves that synthetic data will be essential for training the next generation vision models. 2. In real-world robotic tests, using DKT's depth maps nearly doubled the success rate of robot arms trying to pick up objects on tricky reflective or translucent surfaces. At home robots will need to interact with these types of objects on a daily basis. Check out the project page here: Code is LIVE! #Computervision #Robotics #AIshow more

Jonathan Stephens
17,712 görüntüleme • 5 ay önce