正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Video diffusion models have strong implicit representations of 3D shape, material, and lighting, but controlling them with language is cumbersome, and control is critical for artists and animators. GenLit connects these implicit representations with a continuous 5D control signal describing the direction and intensity of a point light source.... This enables single-image near-field relighting of an image using a video diffusion model. We use a ControlNet-like approach and show that, with a small amount of synthetic data, GenLit generalizes to complex real-world images. Given a single image and the 5D lighting signal, GenLit creates a video of a moving light source that is inside the scene. It moves around and behind scene objects, producing effects such as shading, cast shadows, secularities, and interreflections with a realism that is hard to obtain with traditional inverse rendering methods. GenLit shows that it is possible to get continuous control over implicit physical processes within a video model. I think this is just the beginning and promises to make such models much more practical for creators. Shrisha Bharadwaj will present today at SIGGRAPH Asia Room: S423/S424, Level 4 @ 13:50 on 15 of Dec.show more

Michael Black

98,600 subscribers

22,089 次观看 • 6 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

The Hidden Language of Diffusion Models paper page: tackle the challenge of understanding concept representations in text-to-image models by decomposing an input text prompt into a small set of interpretable elements. This is achieved by learning a pseudo-token that is a sparse weighted combination of tokens from the model's vocabulary, with the objective of reconstructing the images generated for the given concept. Applied over the state-of-the-art Stable Diffusion model, this decomposition reveals non-trivial and surprising structures in the representations of concepts. For example, we find that some concepts such as "a president" or "a composer" are dominated by specific instances (e.g., "Obama", "Biden") and their interpolations. Other concepts, such as "happiness" combine associated terms that can be concrete ("family", "laughter") or abstract ("friendship", "emotion"). In addition to peering into the inner workings of Stable Diffusion, our method also enables applications such as single-image decomposition to tokens, bias detection and mitigation, and semantic image manipulation

AK

41,746 次观看 • 3 年前

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering discuss: The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently "understand" the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

AK

19,101 次观看 • 1 年前

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion TL;DR: Create 3/4DGS from Video Diffusion Note: Some first inference code released (not all yet). Contributions (cited): • We present DimensionX, a novel framework for generating photorealistic 3D and 4D scenes from only a single image using controllable video diffusion. • We propose ST-Director, which decouples the spatial and temporal priors in video diffusion models by learning (spatial and temporal) dimension-aware modules with our curated datasets. We further enhance the hybriddimension control with a training-free composition approach according to the essence of video diffusion denoising process. • To bridge the gap between video diffusion and real-world scenes, we design a trajectory-aware mechanism for 3D generation and an identity-preserving denoising approach for 4D generation, enabling more realistic and controllable scene synthesis. • Extensive experiments manifest that our DimensionX delivers superior performance in video, 3D, and 4D generation compared with baseline methods.

MrNeRF

17,017 次观看 • 1 年前

Text-to-image diffusion transformer models learn to align text and image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.

Text-to-image diffusion transformer models learn to align text and image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.

Alec Helbling

94,065 次观看 • 6 个月前

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

✨ Made a new mini feature on Photo AI: [ Grab from 3d model ] So the problem is we're at that stage in time (typical for AI) where image-to-3d models are not good enough but are fun to play with, but we know they'll be good enough in 1-2 years With [ Make 3d model ] you already can turn any Photo AI pic into a 3d model but it still looks hyper clunky and deformed, but it works! One cool idea I had to make that more useful and made now: Let people make a 3d model then change the view of the it with the 3d viewer, then press [ o ] and it grabs a frame of the 3d That image you can then [ Remix ] (img2img), and it becomes a real photo again and that in turn you can then turn into a video again with [ Make video ] So that essentially gives you a fully freeform camera position control to take photos with One thing I need to fix is the background/skybox, I kinda need to take the original photo and remove the person and just get the background for the 3d model viewer, in this case it should be white, but it's a start!

@levelsio

119,210 次观看 • 11 个月前

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,760 次观看 • 2 年前

the fact that i can take an image of a room and turn it into a 3d model in one shot is actually insane this took like 30 seconds from image to 3d model

the fact that i can take an image of a room and turn it into a 3d model in one shot is actually insane this took like 30 seconds from image to 3d model

Jan

131,884 次观看 • 6 个月前

🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5

🚀 Introducing GenLit – Reformulating Single-Image Relighting as Video Generation! We leverage video diffusion models to perform realistic near-field relighting from just a single image—No explicit 3D reconstruction or ray tracing required! No intermediate graphics buffers, directly in the pixel space! 📄 Dive into the paper: 🎥 Project page & demos: 🛠 Code coming soon! #GenerativeAI #ComputerVision #Relighting #DiffusionModels #Graphics 🧵 1/5

Haven Feng @ CVPR

22,427 次观看 • 1 年前

🇨🇳 Another great Chinese Model, OmniHuman-1.5 from ByteDance Turns 1 image plus a voice track into expressive avatar video by pairing a System 1 and System 2 inspired planner with a Diffusion Transformer, Produces coherent motion for over 1 minute with moving camera and multi character scenes. Most avatar models move to the beat of the audio but miss meaning, so gestures feel generic and emotions feel shallow. The fix here is a Multimodal LLM planner that listens to the speech and drafts a structured plan describing intent, emotions, beats, and high level actions, which gives the motion engine clear semantic targets instead of only rhythm. The motion engine is a Multimodal Diffusion Transformer that fuses the plan with audio, the single reference image, and optional text prompts, then synthesizes continuous body, face, and head motion that matches both words and tone. A key trick is a Pseudo Last Frame, a synthetic target that summarizes the next expected state, which stabilizes fusion across modalities and keeps motion consistent over long spans. From just 1 image and speech, the system outputs speaking avatars with synchronized lips, context aware gestures, and continuous camera movement, and it also supports multi character interactions without manual choreography. Reported results show strong lip sync accuracy, high video quality, natural motion, and close match to text prompts, and the same setup works on nonhuman characters too.

Rohan Paul

63,859 次观看 • 9 个月前

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation paper page: Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to video domain, ensuring temporal consistency across video frames remains a formidable challenge. This paper proposes a novel zero-shot text-guided video-to-video translation framework to adapt image models to videos. The framework includes two parts: key frame translation and full video translation. The first part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors. The second part propagates the key frames to other frames with temporal-aware patch matching and frame blending. Our framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet. Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.

AK

375,080 次观看 • 3 年前

This AI model turns a single image + audio into a talking avatar video of infinite length. StableAvatar is built on Wan 2.1, and MIT Licensed. Open-source AI video is accelerating, and it’s exciting to watch!

Miguel | AP

35,804 次观看 • 10 个月前

From product image to video with just one tool - Dzine As you may have noticed, this is one of my favorite tools. It is also very underrated, as probably 50% of my tutorials include some workflow. I was testing the new image-to-video option today, and I love it. Step - by step guide in comments 🔽 I can do 95% of a workflow without switching between apps. Image generation, Image to image with style reference, background removal, background generation, and 2 frames image to video. The only other app I have been using for this video is CapCut so that I can stitch it together. Step by step 🔽

From product image to video with just one tool - Dzine As you may have noticed, this is one of my favorite tools. It is also very underrated, as probably 50% of my tutorials include some workflow. I was testing the new image-to-video option today, and I love it. Step - by step guide in comments 🔽 I can do 95% of a workflow without switching between apps. Image generation, Image to image with style reference, background removal, background generation, and 2 frames image to video. The only other app I have been using for this video is CapCut so that I can stitch it together. Step by step 🔽

Teodora P L

28,523 次观看 • 1 年前

Placing objects sounds simple… until robots have to do it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !

Placing objects sounds simple… until robots have to do it. This method makes it simple, fast & reliable. [Github ⬇️] Robotic object placement is tough, especially with stacking, hanging, or insertion. AnyPlace is a new two-stage method that uses only synthetic data and a vision-language model to teach robots where and how to place objects; even in the real world. Why this works ✅ Finds the right spot with help from vision-language models ✅ Handles stacking, insertion, and hanging with no real-world training ✅ Trained on synthetic data using Blender and IsaacSim ✅ Works in the real world without fine-tuning It shows that smart use of simulation and language models can make robotic placement tasks easier, faster, and more reliable. Github: Paper: Thank you for sharing Animesh Garg !

Ilir Aliu - eu/acc

22,843 次观看 • 1 年前

Combining the explicit control of 3D software with the creativity of generative AI models is a promising yet underrated workflow. Build your 3D scenes procedurally by describing them in natural language, then take them all the way with your image & video models of choice. Tools like intangible are built around such a workflow so you don't need to duct-tape apps together. Pretty cool!

Combining the explicit control of 3D software with the creativity of generative AI models is a promising yet underrated workflow. Build your 3D scenes procedurally by describing them in natural language, then take them all the way with your image & video models of choice. Tools like intangible are built around such a workflow so you don't need to duct-tape apps together. Pretty cool!

Bilawal Sidhu

37,544 次观看 • 11 个月前

Code of #ACEZero is out. A new approach to SfM. Learn the 3D scene without image-to-image matching. Naturally avoids the explosion of complexity for many images. ACE0 shines if you have dense coverage of a scene. Posing 10k images and more? Sure!

Code of #ACEZero is out. A new approach to SfM. Learn the 3D scene without image-to-image matching. Naturally avoids the explosion of complexity for many images. ACE0 shines if you have dense coverage of a scene. Posing 10k images and more? Sure!

Eric Brachmann

23,988 次观看 • 1 年前

World modeling and imitation learning have largely been considered two disparate worlds. In our recent work, Unified World Models, just accepted to #RSS2025, Chuning Zhu provides a dead-simple unifying solution: just train a joint diffusion model over actions and future states, but with *decoupled* diffusion time steps across these modalities. Manipulating these decoupled time steps then allows for marginalization or conditioning on actions or states; a single model can serve as a policy, forward dynamics model, video prediction model, or inverse dynamics model by simply setting diffusion timesteps carefully. The resulting model can leverage video datasets along with robot training data much more effectively, and shows improved robustness, generalization, and flexibility. This is exciting because it is frustratingly simple, scalable, and shows strong improvement on real-world robotics problems. Please refer to Chuning Zhu 's excellent thread for more details! More details/code can be found on our website and in the paper -

World modeling and imitation learning have largely been considered two disparate worlds. In our recent work, Unified World Models, just accepted to #RSS2025, Chuning Zhu provides a dead-simple unifying solution: just train a joint diffusion model over actions and future states, but with decoupled diffusion time steps across these modalities. Manipulating these decoupled time steps then allows for marginalization or conditioning on actions or states; a single model can serve as a policy, forward dynamics model, video prediction model, or inverse dynamics model by simply setting diffusion timesteps carefully. The resulting model can leverage video datasets along with robot training data much more effectively, and shows improved robustness, generalization, and flexibility. This is exciting because it is frustratingly simple, scalable, and shows strong improvement on real-world robotics problems. Please refer to Chuning Zhu 's excellent thread for more details! More details/code can be found on our website and in the paper -

Abhishek Gupta

11,388 次观看 • 1 年前

✨ Every time the video models get better, the try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shoot

✨ Every time the video models get better, the try on model on Photo AI also becomes a lot more useful, as a large % of my customers now are e-commerce store And showing clothes in a video is nice for sales! With AI this means stores don't need to do expensive shoots flying a model and entire camera and light crew around the world They can just upload a few photos of their models, then upload the clothes, and describe the setting (like a beach in Thailand) and in less than 10 seconds it's generated, for a video in less than a minute! Below is the input: a dress laid flat, and output: a full video shoot

@levelsio

331,475 次观看 • 1 年前

This is a 🔥 launch for AI image generation! I uploaded four photos of myself to KREA AI and got a model that generates real-time images of me. Adjust the prompt and you'll immediately see a change - watch me travel around the world ✨

This is a 🔥 launch for AI image generation! I uploaded four photos of myself to KREA AI and got a model that generates real-time images of me. Adjust the prompt and you'll immediately see a change - watch me travel around the world ✨

Justine Moore

76,200 次观看 • 1 年前

The one thing we absolutely know with certainty about the Richat Structure is that there was an incredible amount of human activity here from the dawn of toolmaking and tool use itself. The Acheulean Hand Axe is the second tool that humans ever made and this technology spread throughout Africa, Europe and Asia while no definitive evidence exists that this technology ever made it to the Americas. I think this is evidence that this technology was shared and disseminated intentionally, that neighboring humans didn’t come up with this independently on their own but this was a legacy, an institution, a skill and an industry that was shared and taught. It is perhaps the earliest evidence we have of widespread cultural diffusion throughout Africa, Europe and Asia. And yes, it’s right there at the Richat Structure which leads me to question the role that the Richat played to humans half a million years ago. It seems like that’s where we have to start with the Richat.

The one thing we absolutely know with certainty about the Richat Structure is that there was an incredible amount of human activity here from the dawn of toolmaking and tool use itself. The Acheulean Hand Axe is the second tool that humans ever made and this technology spread throughout Africa, Europe and Asia while no definitive evidence exists that this technology ever made it to the Americas. I think this is evidence that this technology was shared and disseminated intentionally, that neighboring humans didn’t come up with this independently on their own but this was a legacy, an institution, a skill and an industry that was shared and taught. It is perhaps the earliest evidence we have of widespread cultural diffusion throughout Africa, Europe and Asia. And yes, it’s right there at the Richat Structure which leads me to question the role that the Richat played to humans half a million years ago. It seems like that’s where we have to start with the Richat.

Archaic Lens

35,584 次观看 • 7 个月前

So this is the dream: A video world model that takes an image as input and renders an environment you can explore and interact with. It could be a constant video stream - like your own lofi girl! Or you could jump in and “play” as a character.

So this is the dream: A video world model that takes an image as input and renders an environment you can explore and interact with. It could be a constant video stream - like your own lofi girl! Or you could jump in and “play” as a character.

Justine Moore

95,440 次观看 • 10 个月前