正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

✨ 🖼️ Generate images with consistent characters without any fine tuning nor training. Consistory is a training-free approach to maintaining subject consistency between text-to-image generations on pretrained models. #NVIDIAResearch 🎨 Test it here:

NVIDIA AI Developer

111,700 subscribers

11,612 次观看 • 1 年前 •via X (Twitter)

艺术教育科学技术 #NVIDIAResearch

Anya Rossi• Live Now

Private livecam show

2 条评论

NVIDIA AI Developer 的头像

NVIDIA AI Developer1 年前

📝 For more details on Consistory check out the #NVIDIAResearch project page: ✨

tom 的头像

tom1 年前

חמוד

相关视频

Nvidia presents ConsiStory Training-Free Consistent Text-to-Image Generation paper page: enable Stable Diffusion XL (SDXL) to generate consistent subjects across a series of images, without additional training.

Nvidia presents ConsiStory Training-Free Consistent Text-to-Image Generation paper page: enable Stable Diffusion XL (SDXL) to generate consistent subjects across a series of images, without additional training.

AK

161,685 次观看 • 2 年前

Creativity unleashed 👨‍🎨🎨✨ 👀 Edify3D from #NVIDIAResearch explores a new way to generate text-to-#3D images. 📗 Project page: 📝 Paper: 🔊 sound on

Creativity unleashed 👨‍🎨🎨✨ 👀 Edify3D from #NVIDIAResearch explores a new way to generate text-to-#3D images. 📗 Project page: 📝 Paper: 🔊 sound on

NVIDIA AI Developer

177,497 次观看 • 1 年前

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

AK

38,416 次观看 • 1 年前

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

Matthias Niessner

19,004 次观看 • 4 个月前

🚀 Flux from Black Forest Labs is now on RenderNet! Create hyper-realistic, character-consistent images with just one reference image — no LORAs required. Ready to bring your characters to life? Try it with just a few clicks! 🎨✨

🚀 Flux from Black Forest Labs is now on RenderNet! Create hyper-realistic, character-consistent images with just one reference image — no LORAs required. Ready to bring your characters to life? Try it with just a few clicks! 🎨✨

Affogato AI

59,705 次观看 • 1 年前

👀 Pixel perfect 💎✨ 🖼️ Edify Image from #NVIDIAResearch is a family of diffusion models that supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360° HDR panorama generation, and finetuning for image customization. 🧵 1/2

👀 Pixel perfect 💎✨ 🖼️ Edify Image from #NVIDIAResearch is a family of diffusion models that supports a wide range of applications, including text-to-image synthesis, 4K upsampling, ControlNets, 360° HDR panorama generation, and finetuning for image customization. 🧵 1/2

NVIDIA AI Developer

14,747 次观看 • 1 年前

Sharing something exciting we've been working on as a Thanksgiving gift: Diffusion Self-Distillation (DSD), which redefines zero-shot customized image generation using FLUX. DSD is like DreamBooth, but zero-shot/training-free. It works across any input subject and desired context—character consistency, item/asset adaptation, scene relighting, and more. It even enables the creation of comics/mangas without any effort in fine-tuning or training a personalized model! 📰 Paper: 🌐 Website: Team effort with Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein.

Sharing something exciting we've been working on as a Thanksgiving gift: Diffusion Self-Distillation (DSD), which redefines zero-shot customized image generation using FLUX. DSD is like DreamBooth, but zero-shot/training-free. It works across any input subject and desired context—character consistency, item/asset adaptation, scene relighting, and more. It even enables the creation of comics/mangas without any effort in fine-tuning or training a personalized model! 📰 Paper: 🌐 Website: Team effort with Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, and Gordon Wetzstein.

Prime (Shengqu) Cai

60,613 次观看 • 1 年前

🎉 Meet Krea 2 from Krea, an aesthetic open-source image model ranked #1 text-to-image from an independent lab on Artificial Analysis. Day-0 support is now live in SGLang! Krea 2 ships as two models built to work together: 1️⃣ RAW: undistilled base checkpoint: diverse & malleable, made for fine-tuning, post-training & LoRA 2️⃣ Turbo: 8-step distilled checkpoint: fast, high-quality text-to-image Train LoRAs on RAW, run them on Turbo: base for training, Turbo for fast inference on your own hardware Run it now with SGLang!

🎉 Meet Krea 2 from Krea, an aesthetic open-source image model ranked #1 text-to-image from an independent lab on Artificial Analysis. Day-0 support is now live in SGLang! Krea 2 ships as two models built to work together: 1️⃣ RAW: undistilled base checkpoint: diverse & malleable, made for fine-tuning, post-training & LoRA 2️⃣ Turbo: 8-step distilled checkpoint: fast, high-quality text-to-image Train LoRAs on RAW, run them on Turbo: base for training, Turbo for fast inference on your own hardware Run it now with SGLang!

LMSYS Org

11,216 次观看 • 1 个月前

Wan2.6: Commercial-Grade Image Output Wan2.6-Image is now available. 🖼️ Interleaved Text-and-Image Output：Generate interleaved text-and-image content with logical reasoning capabilities — enabling layered, narrative-driven visual storytelling. 🖼️ Multi-Image Conditioned Generation：Support flexible referencing, combining, and replacement of multiple images, integrating varied visual inspirations to generate novel and compelling results 🖼️ Commercial-Grade ID Preservation：High consistency in characters, styles, and elements for commercial scenarios. 🖼️ Extract creative elements — such as color, style, and composition — from reference images to enable aesthetically driven image generation. 🖼️ Precise Control of Camera Angles and Lighting：Support specifying camera perspective, spatial depth (foreground/background), and lighting details — enabling precise control over spatial composition and atmospheric mood. Try Wan2.6 today.

Wan2.6: Commercial-Grade Image Output Wan2.6-Image is now available. 🖼️ Interleaved Text-and-Image Output：Generate interleaved text-and-image content with logical reasoning capabilities — enabling layered, narrative-driven visual storytelling. 🖼️ Multi-Image Conditioned Generation：Support flexible referencing, combining, and replacement of multiple images, integrating varied visual inspirations to generate novel and compelling results 🖼️ Commercial-Grade ID Preservation：High consistency in characters, styles, and elements for commercial scenarios. 🖼️ Extract creative elements — such as color, style, and composition — from reference images to enable aesthetically driven image generation. 🖼️ Precise Control of Camera Angles and Lighting：Support specifying camera perspective, spatial depth (foreground/background), and lighting details — enabling precise control over spatial composition and atmospheric mood. Try Wan2.6 today.

Wan

33,277 次观看 • 7 个月前

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,658 次观看 • 10 个月前

We’re excited to launch 𝗦𝘆𝗻𝘁𝗵𝗜𝗗 today with GoogleCloud: a digital tool to watermark and identify AI-generated images. 🖼️ It will be available on Imagen, one of Google’s latest text-to-image models. Here’s how it works. 🧵 #GoogleCloudNext

We’re excited to launch 𝗦𝘆𝗻𝘁𝗵𝗜𝗗 today with GoogleCloud: a digital tool to watermark and identify AI-generated images. 🖼️ It will be available on Imagen, one of Google’s latest text-to-image models. Here’s how it works. 🧵 #GoogleCloudNext

Google DeepMind

333,881 次观看 • 2 年前

AI is ready to make full films Seedance 2.0 now can read your entire shot list to generate a full story.. keep characters, props and set design consistent with one image on BytePlus duration and consistency is not a problem anymore here's how with prompts:

AI is ready to make full films Seedance 2.0 now can read your entire shot list to generate a full story.. keep characters, props and set design consistent with one image on BytePlus duration and consistency is not a problem anymore here's how with prompts:

el.cine

411,388 次观看 • 2 个月前

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 次观看 • 3 年前

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

In Prompt Engineering for Vision Models, taught by Abby Jacques Verre and Caleb Kaiser of Comet , you’ll learn how to prompt and fine-tune vision models for personalized image generation, image editing, object detection and segmentation. The prompts you'll use for vision models could be text, point coordinates, or bounding boxes, depending on the model. You'll also learn to tune hyperparameters to shape the output. Models you'll use include Segment-Anything Model (SAM), OWL-ViT, and Stable Diffusion. You'll also learn to fine-tune Stable Diffusion to generate personalized images (say, an image of a specific person), using a handful of images for training. As an example of a multi-step workflow, you'll use OWL-ViT to detect an object based on a text prompt, then pass the bounding box to SAM to create a segmentation mask, and input that mask into Stable Diffusion to replace the original object with a new one based on a text prompt. Controlling vision models can be tricky; this course will teach prompting and fine-tuning techniques to get precise control over their output. Get started here:

Andrew Ng

151,198 次观看 • 2 年前

How to Train Your Mochi: Introducing LoRA fine-tuning. Customize Mochi on a single GPU with just a few videos. Create any effect or create consistent characters. Make Mochi 1 truly yours.

How to Train Your Mochi: Introducing LoRA fine-tuning. Customize Mochi on a single GPU with just a few videos. Create any effect or create consistent characters. Make Mochi 1 truly yours.

Genmo

113,880 次观看 • 1 年前

Last week we released Meta Chameleon: a new mixed-modal research model from Meta FAIR. Get the models ➡️ The 7B & 34B safety tuned models we’ve released can take any combination of text and images as input and produce text outputs using a new early fusion approach. While some LLMs have separate image and text encoders or decoders, Chameleon is one of the first publicly released approaches using a single unified architecture. We’re releasing Chameleon models under a research license to help democratize access to foundational mixed-modal models & further research on early fusion. Approach & training details in the paper ➡️

Last week we released Meta Chameleon: a new mixed-modal research model from Meta FAIR. Get the models ➡️ The 7B & 34B safety tuned models we’ve released can take any combination of text and images as input and produce text outputs using a new early fusion approach. While some LLMs have separate image and text encoders or decoders, Chameleon is one of the first publicly released approaches using a single unified architecture. We’re releasing Chameleon models under a research license to help democratize access to foundational mixed-modal models & further research on early fusion. Approach & training details in the paper ➡️

AI at Meta

54,428 次观看 • 2 年前

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

Andrew Ng

132,304 次观看 • 9 个月前

StyleDrop: Text-to-Image Generation in Any Style introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. paper page:

StyleDrop: Text-to-Image Generation in Any Style introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. paper page:

AK

56,377 次观看 • 3 年前

Seedream 5.0 Lite just landed on HailuoAI. 🙌 ➡️ Precise edits, fully controllable 🔗 Lock consistency with up to 14 refs 🧠 Reason across text + image like it has a brain 🎨 Turn messy knowledge into pro visuals ✨ Batch generate without losing your taste Hailuo Members get 365 days unlimited. Go Take it further! 🚀 #Hailuo #seedream5

Seedream 5.0 Lite just landed on HailuoAI. 🙌 ➡️ Precise edits, fully controllable 🔗 Lock consistency with up to 14 refs 🧠 Reason across text + image like it has a brain 🎨 Turn messy knowledge into pro visuals ✨ Batch generate without losing your taste Hailuo Members get 365 days unlimited. Go Take it further! 🚀 #Hailuo #seedream5

Hailuo AI (MiniMax)

291,057 次观看 • 5 个月前