正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Text-to-image diffusion transformer models learn to align text and image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.

Alec Helbling

10,800 subscribers

94,095 次观看 • 7 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Text-to-image generators are incredible! But why must we use *plain text* as the interface? 🤔 Check out Rich Text-to-Image generation! Rich Text offers an intuitive and expressive way to explore your visual imagination. Some of my favorite examples. 👇

Text-to-image generators are incredible! But why must we use plain text as the interface? 🤔 Check out Rich Text-to-Image generation! Rich Text offers an intuitive and expressive way to explore your visual imagination. Some of my favorite examples. 👇

Jia-Bin Huang

73,382 次观看 • 2 年前

Switti -- a new scale-wise transformer for text-to-image generation 🦾 🔥 Improved generation of fine-grained details. Outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Switti -- a new scale-wise transformer for text-to-image generation 🦾 🔥 Improved generation of fine-grained details. Outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Gradio

29,314 次观看 • 1 年前

Diffusion Transformers aren't just generative models, but also powerful multi-modal encoders. ConceptAttention creates rich heatmaps of text concepts in images from DiT representations. This even works on real images, and can be applied to tasks like segmentation! Demo 👇

Diffusion Transformers aren't just generative models, but also powerful multi-modal encoders. ConceptAttention creates rich heatmaps of text concepts in images from DiT representations. This even works on real images, and can be applied to tasks like segmentation! Demo 👇

Alec Helbling

24,419 次观看 • 1 年前

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

AK

47,488 次观看 • 2 年前

FLUX.2 is live on Runware! D0 drop! 🔥 Built by Black Forest Labs, FLUX.2 is the new SOTA model that brings better control over structure, text, and references in image generation. This is a huge step forward for image gen & image editing. And we’re here to offer you the best prices on D0. This model comes in three versions. Details below.

FLUX.2 is live on Runware! D0 drop! 🔥 Built by Black Forest Labs, FLUX.2 is the new SOTA model that brings better control over structure, text, and references in image generation. This is a huge step forward for image gen & image editing. And we’re here to offer you the best prices on D0. This model comes in three versions. Details below.

Runware

293,432 次观看 • 8 个月前

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

Lior Alexander

158,565 次观看 • 3 年前

You can build your own AI-apps powered by OpenAI and Stable Diffusion. **Even if you don't know how to code** Using no-code, build your own text and image generation apps with: - Bubble - OpenAI - Stable Diffusion Learn how:

You can build your own AI-apps powered by OpenAI and Stable Diffusion. Even if you don't know how to code Using no-code, build your own text and image generation apps with: - Bubble - OpenAI - Stable Diffusion Learn how:

Seth Kramer

6,893,724 次观看 • 3 年前

RIP Photoshop. Higgsfield just launched Canvas, a state-of-the-art image editing model. With just two clicks, you can swap any object in an image. Logos, text, texture and scale stay exactly as they are... Here's how it works:

RIP Photoshop. Higgsfield just launched Canvas, a state-of-the-art image editing model. With just two clicks, you can swap any object in an image. Logos, text, texture and scale stay exactly as they are... Here's how it works:

Angry Tom

156,230 次观看 • 1 年前

Adobe just launched the beta of Firefly, their AI image creator. And it's incredible! - text-to-image - AI 3D modelling - AI Video editing This is the first AI image generator built towards solving creator problems. Here's a breakdown.

Adobe just launched the beta of Firefly, their AI image creator. And it's incredible! - text-to-image - AI 3D modelling - AI Video editing This is the first AI image generator built towards solving creator problems. Here's a breakdown.

Sudharshan

1,224,171 次观看 • 3 年前

Text to Image: ChatGPT Image to Video: Kling 2.0 Prompt: Shot from a person's point of view (POV), the person starts using computer Kling AI

Text to Image: ChatGPT Image to Video: Kling 2.0 Prompt: Shot from a person's point of view (POV), the person starts using computer Kling AI

Ozan Sihay

40,919 次观看 • 1 年前

ANNOUNCEMENT Now anyone can use Stable Diffusion XL, our latest text-to-image generative AI model, on our Clipdrop platform, for free! Test the limits of your creativity. #SDXL Try it here →

ANNOUNCEMENT Now anyone can use Stable Diffusion XL, our latest text-to-image generative AI model, on our Clipdrop platform, for free! Test the limits of your creativity. #SDXL Try it here →

Stability AI

348,371 次观看 • 3 年前

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,792 次观看 • 2 年前

🤯 𝐒𝐥𝐢𝐜𝐞𝐝𝐢𝐭 : A revolutionary approach to zero-shot video editing✏️🎥 using text-to-image diffusion models 🎞️Slicedit achieves great temporal consistency by leveraging spatio-temporal "slices"🔀

🤯 𝐒𝐥𝐢𝐜𝐞𝐝𝐢𝐭 : A revolutionary approach to zero-shot video editing✏️🎥 using text-to-image diffusion models 🎞️Slicedit achieves great temporal consistency by leveraging spatio-temporal "slices"🔀

Gradio

19,586 次观看 • 2 年前

New ML gem on the hub: LDM3D by Intel. This diffusion model generates image & depth from text prompts. Using a custom @gradio 6dof three.js component you can generate immersive 360-degree views from prompts demo: model:

New ML gem on the hub: LDM3D by Intel. This diffusion model generates image & depth from text prompts. Using a custom @gradio 6dof three.js component you can generate immersive 360-degree views from prompts demo: model:

Radamés Ajna

56,696 次观看 • 2 年前

SOMEONE GOT TIRED OF PAYING HIGGSFIELD AI'S SUBSCRIPTION SO HE REBUILT THE WHOLE THING AND OPEN-SOURCED IT 200+ models. text-to-image, image-to-image, text-to-video, image-to-video all in one interface you configure a virtual camera in the Cinema Studio. pick the body, the lens, the focal length, the aperture and it writes the optimized cinematic prompt for you. completely in the background you never touch the camera keywords. you just set up the shot like a real cinematographer would Kling v3, Sora 2, Veo 3, Flux Dev, Midjourney v7, GPT-4o, Seedream 5.0, Runway Gen-3 all in there self-hosted. MIT licensed. runs on your machine. your data stays local the only thing you pay for is the model API calls themselves someone built this so you never have to pay Higgsfield AI again

SOMEONE GOT TIRED OF PAYING HIGGSFIELD AI'S SUBSCRIPTION SO HE REBUILT THE WHOLE THING AND OPEN-SOURCED IT 200+ models. text-to-image, image-to-image, text-to-video, image-to-video all in one interface you configure a virtual camera in the Cinema Studio. pick the body, the lens, the focal length, the aperture and it writes the optimized cinematic prompt for you. completely in the background you never touch the camera keywords. you just set up the shot like a real cinematographer would Kling v3, Sora 2, Veo 3, Flux Dev, Midjourney v7, GPT-4o, Seedream 5.0, Runway Gen-3 all in there self-hosted. MIT licensed. runs on your machine. your data stays local the only thing you pay for is the model API calls themselves someone built this so you never have to pay Higgsfield AI again

Rimsha Bhardwaj

101,058 次观看 • 3 个月前

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

AK

124,048 次观看 • 1 年前

Yesterday someone suggested I create a fight between Alucard and Drolta Tzuentes using Kling AI 2.6. I usually work with Text to Video, but in this case I had to use Image to Video. The base image is a fusion of two images created in Niji and combined with GPT Image 1.5. Then I added the right prompt to give the fight speed and dynamism. The result isn’t perfect, but it’s pretty cool.

Yesterday someone suggested I create a fight between Alucard and Drolta Tzuentes using Kling AI 2.6. I usually work with Text to Video, but in this case I had to use Image to Video. The base image is a fusion of two images created in Niji and combined with GPT Image 1.5. Then I added the right prompt to give the fight speed and dynamism. The result isn’t perfect, but it’s pretty cool.

OscarAI

18,533 次观看 • 7 个月前

Flux Kontext is easily one of the best image editing models out there, but it’s capable of so much more. A quick test: creating a product shot using only Flux Kontext. This was done with just one reference image and a few prompts to generate different angles.

Flux Kontext is easily one of the best image editing models out there, but it’s capable of so much more. A quick test: creating a product shot using only Flux Kontext. This was done with just one reference image and a few prompts to generate different angles.

Halim Alrasihi

58,759 次观看 • 1 年前

Adobe just released Transparent Background videos for Firefly Video model 🎉 Formerly known as TransPixar (later named Transpixeler) it's here for you to play on Firefly web. It supports both text-to-image and image-to-video.

Adobe just released Transparent Background videos for Firefly Video model 🎉 Formerly known as TransPixar (later named Transpixeler) it's here for you to play on Firefly web. It supports both text-to-image and image-to-video.

Kris Kashtanova

33,460 次观看 • 10 个月前

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Stability AI

1,024,498 次观看 • 2 年前