正在加载视频...

视频加载失败

Excited to share "MultiDiffusion"! A controlled image generation framework w/ pre-trained text-to-image diffusion model. * Spatial guidance controls (bounding boxes/masks) * Arbitrary aspect ratios (huge Panoramas!) NO training NO finetuning. [1/3]Lior Yariv Yaron Lipman Tali Dekel

88,845 次观看 • 3 年前 •via X (Twitter)

10 条评论

Omer Bar Tal 的头像
Omer Bar Tal3 年前

Our key idea is to define a new generation process, based on an optimization task that binds together multiple diffusion paths. The optimal solution is given in closed-form, and can be found analytically, without a computational overhead. [2/3]

Omer Bar Tal 的头像
Omer Bar Tal3 年前

Visit our project webpage for more details, results, and code 🥳 Arxiv: [3/3]

Omer Bar Tal 的头像
Omer Bar Tal3 年前

MultiDiffusion is now integrated into diffusers 🚀 currently text2panorama is supported, spatial controls (masks/bounding boxes)- soon :) demo: official repo: Thanks @RisingSayak @_akhaliq and @huggingface team!

Hila Chefer 的头像
Hila Chefer3 年前

@YarivLior @lipmanya @talidekel Very cool work! Congrats @omerbartal 🎊

Omer Bar Tal 的头像
Omer Bar Tal3 年前

@YarivLior @lipmanya @talidekel Thanks @hila_chefer :)

Sebastian Bugge Loeschcke 的头像
Sebastian Bugge Loeschcke3 年前

@YarivLior @lipmanya @talidekel Super cool work @omerbartal!

Lucas Beyer (bl16) 的头像
Lucas Beyer (bl16)3 年前

@YarivLior @lipmanya @talidekel Super cool, and nice demo! I think you have a typo in the gif: a tree trunk, not a tree truck, though the latter would also be fun to see =)

Omer Bar Tal 的头像
Omer Bar Tal3 年前

@YarivLior @lipmanya @talidekel Thanks! Ohh definitely a typo, but a cool idea to try ;)

Richard Löwenström 的头像
Richard Löwenström3 年前

@YarivLior @lipmanya @talidekel Nice background trick! I think I've the merging of predictions before though but not so nicely mathematically motivated. I think there's a PR to diffusers upscaling x4 that does something similar for example

Richard Löwenström 的头像
Richard Löwenström3 年前

@YarivLior @lipmanya @talidekel Here's the paper I was thinking about but I may have misunderstood the math 🙏

相关视频

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,523 次观看 • 8 个月前

Google presents Still-Moving Customized Video Generation without Customized Video Data Customizing text-to-image (T2I) models has seen tremendous progress recently, particularly in areas such as personalization, stylization, and conditional generation. However, expanding this progress to video generation is still in its infancy, primarily due to the lack of customized video data. In this work, we introduce Still-Moving, a novel generic framework for customizing a text-to-video (T2V) model, without requiring any customized video data. The framework applies to the prominent T2V design where the video model is built over a text-to-image (T2I) model (e.g., via inflation). We assume access to a customized version of the T2I model, trained only on still image data (e.g., using DreamBooth or StyleDrop). Naively plugging in the weights of the customized T2I model into the T2V model often leads to significant artifacts or insufficient adherence to the customization data. To overcome this issue, we train lightweight Spatial Adapters that adjust the features produced by the injected T2I layers. Importantly, our adapters are trained on "frozen videos" (i.e., repeated images), constructed from image samples generated by the customized T2I model. This training is facilitated by a novel Motion Adapter module, which allows us to train on such static videos while preserving the motion prior of the video model. At test time, we remove the Motion Adapter modules and leave in only the trained Spatial Adapters. This restores the motion prior of the T2V model while adhering to the spatial prior of the customized T2I model. We demonstrate the effectiveness of our approach on diverse tasks including personalized, stylized, and conditional generation. In all evaluated scenarios, our method seamlessly integrates the spatial prior of the customized T2I model with a motion prior supplied by the T2V model.

AK

40,467 次观看 • 1 年前