Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Text-to-image diffusion transformer models learn to align text and image representations as a byproduct of their conditional denoising task. By taking the dot product between the text and image representations of a DiT model (like Flux 2), you can create rich saliency maps.

Alec Helbling

10,800 subscribers

94,095 Aufrufe • vor 7 Monaten •via X (Twitter)

Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Text-to-image generators are incredible! But why must we use *plain text* as the interface? 🤔 Check out Rich Text-to-Image generation! Rich Text offers an intuitive and expressive way to explore your visual imagination. Some of my favorite examples. 👇

Text-to-image generators are incredible! But why must we use plain text as the interface? 🤔 Check out Rich Text-to-Image generation! Rich Text offers an intuitive and expressive way to explore your visual imagination. Some of my favorite examples. 👇

Jia-Bin Huang

73,382 Aufrufe • vor 2 Jahren

Switti -- a new scale-wise transformer for text-to-image generation 🦾 🔥 Improved generation of fine-grained details. Outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Switti -- a new scale-wise transformer for text-to-image generation 🦾 🔥 Improved generation of fine-grained details. Outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Gradio

29,314 Aufrufe • vor 1 Jahr

Diffusion Transformers aren't just generative models, but also powerful multi-modal encoders. ConceptAttention creates rich heatmaps of text concepts in images from DiT representations. This even works on real images, and can be applied to tasks like segmentation! Demo 👇

Diffusion Transformers aren't just generative models, but also powerful multi-modal encoders. ConceptAttention creates rich heatmaps of text concepts in images from DiT representations. This even works on real images, and can be applied to tasks like segmentation! Demo 👇

Alec Helbling

24,419 Aufrufe • vor 1 Jahr

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

Snap presents MoA Mixture-of-Attention for Subject-Context Disentanglement in Personalized Image Generation We introduce a new architecture for personalization of text-to-image diffusion models, coined Mixture-of-Attention (MoA). Inspired by the Mixture-of-Experts

AK

47,488 Aufrufe • vor 2 Jahren

FLUX.2 is live on Runware! D0 drop! 🔥 Built by Black Forest Labs, FLUX.2 is the new SOTA model that brings better control over structure, text, and references in image generation. This is a huge step forward for image gen & image editing. And we’re here to offer you the best prices on D0. This model comes in three versions. Details below.

FLUX.2 is live on Runware! D0 drop! 🔥 Built by Black Forest Labs, FLUX.2 is the new SOTA model that brings better control over structure, text, and references in image generation. This is a huge step forward for image gen & image editing. And we’re here to offer you the best prices on D0. This model comes in three versions. Details below.

Runware

293,432 Aufrufe • vor 8 Monaten

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

NVIDIA just released a very impressive text-to-video paper. Video Latent Diffusion Models (Video LDMs) use a diffusion model in a compressed latent space to generate high-resolution videos. Here's a brief overview of how it works: 1. Pre-train image LDM on a dataset of images. 2. Turn the image LDM into a Video LDM by adding temporal layers to model video frames. 3. Fine-tune the Video LDM on encoded video sequences to create a video generator. 4. Temporally align diffusion model upsamplers to generate high-resolution videos. 5. Validate Video LDM on real driving videos of 512x1024 resolution, achieving state-of-the-art performance. 6. Apply the approach in creative content creation with text-to-video modeling. Paper: Project:

Lior Alexander

158,565 Aufrufe • vor 3 Jahren

You can build your own AI-apps powered by OpenAI and Stable Diffusion. **Even if you don't know how to code** Using no-code, build your own text and image generation apps with: - Bubble - OpenAI - Stable Diffusion Learn how:

You can build your own AI-apps powered by OpenAI and Stable Diffusion. Even if you don't know how to code Using no-code, build your own text and image generation apps with: - Bubble - OpenAI - Stable Diffusion Learn how:

Seth Kramer

6,893,724 Aufrufe • vor 3 Jahren

RIP Photoshop. Higgsfield just launched Canvas, a state-of-the-art image editing model. With just two clicks, you can swap any object in an image. Logos, text, texture and scale stay exactly as they are... Here's how it works:

RIP Photoshop. Higgsfield just launched Canvas, a state-of-the-art image editing model. With just two clicks, you can swap any object in an image. Logos, text, texture and scale stay exactly as they are... Here's how it works:

Angry Tom

156,230 Aufrufe • vor 1 Jahr

Adobe just launched the beta of Firefly, their AI image creator. And it's incredible! - text-to-image - AI 3D modelling - AI Video editing This is the first AI image generator built towards solving creator problems. Here's a breakdown.

Adobe just launched the beta of Firefly, their AI image creator. And it's incredible! - text-to-image - AI 3D modelling - AI Video editing This is the first AI image generator built towards solving creator problems. Here's a breakdown.

Sudharshan

1,224,171 Aufrufe • vor 3 Jahren

Text to Image: ChatGPT Image to Video: Kling 2.0 Prompt: Shot from a person's point of view (POV), the person starts using computer Kling AI

Text to Image: ChatGPT Image to Video: Kling 2.0 Prompt: Shot from a person's point of view (POV), the person starts using computer Kling AI

Ozan Sihay

40,919 Aufrufe • vor 1 Jahr

ANNOUNCEMENT Now anyone can use Stable Diffusion XL, our latest text-to-image generative AI model, on our Clipdrop platform, for free! Test the limits of your creativity. #SDXL Try it here →

ANNOUNCEMENT Now anyone can use Stable Diffusion XL, our latest text-to-image generative AI model, on our Clipdrop platform, for free! Test the limits of your creativity. #SDXL Try it here →

Stability AI

348,371 Aufrufe • vor 3 Jahren

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Create a 3D model from a single image, set of images or a text prompt in < 1 minute 😮‍💨 This new AI paper called CAT3D shows us that it’ll keep getting easier to produce 3D models from 2D images — whether it’s a sparser real world 3D scan (a few photos instead of hundreds) or your favorite 2D image generator like Midjourney (just an image). How does this magic work? “This architecture is similar to video diffusion models, but with camera pose embeddings for each image instead of time embeddings. The generated views are passed into a robust 3D reconstruction pipeline to create the 3D representation (Zip-NeRF or 3DGS)”

Bilawal Sidhu

92,792 Aufrufe • vor 2 Jahren

🤯 𝐒𝐥𝐢𝐜𝐞𝐝𝐢𝐭 : A revolutionary approach to zero-shot video editing✏️🎥 using text-to-image diffusion models 🎞️Slicedit achieves great temporal consistency by leveraging spatio-temporal "slices"🔀

🤯 𝐒𝐥𝐢𝐜𝐞𝐝𝐢𝐭 : A revolutionary approach to zero-shot video editing✏️🎥 using text-to-image diffusion models 🎞️Slicedit achieves great temporal consistency by leveraging spatio-temporal "slices"🔀

Gradio

19,586 Aufrufe • vor 2 Jahren

New ML gem on the hub: LDM3D by Intel. This diffusion model generates image & depth from text prompts. Using a custom @gradio 6dof three.js component you can generate immersive 360-degree views from prompts demo: model:

New ML gem on the hub: LDM3D by Intel. This diffusion model generates image & depth from text prompts. Using a custom @gradio 6dof three.js component you can generate immersive 360-degree views from prompts demo: model:

Radamés Ajna

56,696 Aufrufe • vor 2 Jahren

SOMEONE GOT TIRED OF PAYING HIGGSFIELD AI'S SUBSCRIPTION SO HE REBUILT THE WHOLE THING AND OPEN-SOURCED IT 200+ models. text-to-image, image-to-image, text-to-video, image-to-video all in one interface you configure a virtual camera in the Cinema Studio. pick the body, the lens, the focal length, the aperture and it writes the optimized cinematic prompt for you. completely in the background you never touch the camera keywords. you just set up the shot like a real cinematographer would Kling v3, Sora 2, Veo 3, Flux Dev, Midjourney v7, GPT-4o, Seedream 5.0, Runway Gen-3 all in there self-hosted. MIT licensed. runs on your machine. your data stays local the only thing you pay for is the model API calls themselves someone built this so you never have to pay Higgsfield AI again

SOMEONE GOT TIRED OF PAYING HIGGSFIELD AI'S SUBSCRIPTION SO HE REBUILT THE WHOLE THING AND OPEN-SOURCED IT 200+ models. text-to-image, image-to-image, text-to-video, image-to-video all in one interface you configure a virtual camera in the Cinema Studio. pick the body, the lens, the focal length, the aperture and it writes the optimized cinematic prompt for you. completely in the background you never touch the camera keywords. you just set up the shot like a real cinematographer would Kling v3, Sora 2, Veo 3, Flux Dev, Midjourney v7, GPT-4o, Seedream 5.0, Runway Gen-3 all in there self-hosted. MIT licensed. runs on your machine. your data stays local the only thing you pay for is the model API calls themselves someone built this so you never have to pay Higgsfield AI again

Rimsha Bhardwaj

101,058 Aufrufe • vor 3 Monaten

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

Show-o One Single Transformer to Unify Multimodal Understanding and Generation discuss: We present a unified transformer, i.e., Show-o, that unifies multimodal understanding and generation. Unlike fully autoregressive models, Show-o unifies autoregressive and (discrete) diffusion modeling to adaptively handle inputs and outputs of various and mixed modalities. The unified model flexibly supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation. Across various benchmarks, it demonstrates comparable or superior performance to existing individual models with an equivalent or larger number of parameters tailored for understanding or generation. This significantly highlights its potential as a next-generation foundation model.

AK

124,048 Aufrufe • vor 1 Jahr

Yesterday someone suggested I create a fight between Alucard and Drolta Tzuentes using Kling AI 2.6. I usually work with Text to Video, but in this case I had to use Image to Video. The base image is a fusion of two images created in Niji and combined with GPT Image 1.5. Then I added the right prompt to give the fight speed and dynamism. The result isn’t perfect, but it’s pretty cool.

Yesterday someone suggested I create a fight between Alucard and Drolta Tzuentes using Kling AI 2.6. I usually work with Text to Video, but in this case I had to use Image to Video. The base image is a fusion of two images created in Niji and combined with GPT Image 1.5. Then I added the right prompt to give the fight speed and dynamism. The result isn’t perfect, but it’s pretty cool.

OscarAI

18,533 Aufrufe • vor 7 Monaten

Flux Kontext is easily one of the best image editing models out there, but it’s capable of so much more. A quick test: creating a product shot using only Flux Kontext. This was done with just one reference image and a few prompts to generate different angles.

Flux Kontext is easily one of the best image editing models out there, but it’s capable of so much more. A quick test: creating a product shot using only Flux Kontext. This was done with just one reference image and a few prompts to generate different angles.

Halim Alrasihi

58,759 Aufrufe • vor 1 Jahr

Adobe just released Transparent Background videos for Firefly Video model 🎉 Formerly known as TransPixar (later named Transpixeler) it's here for you to play on Firefly web. It supports both text-to-image and image-to-video.

Adobe just released Transparent Background videos for Firefly Video model 🎉 Formerly known as TransPixar (later named Transpixeler) it's here for you to play on Firefly web. It supports both text-to-image and image-to-video.

Kris Kashtanova

33,460 Aufrufe • vor 10 Monaten

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Today, we are releasing Stable Video Diffusion, our first foundation model for generative AI video based on the image model, Stable Diffusion. As part of this research preview, the code, weights, and research paper are now available. Additionally, today you can sign up for our waitlist to access a new upcoming web experience featuring a Text-To-Video interface. To access the model & sign up for our waitlist, visit our website here:

Stability AI

1,024,498 Aufrufe • vor 2 Jahren