Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

AK

503,914 subscribers

165,158 Aufrufe • vor 3 Jahren •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

9 Kommentare

Profilbild von salt

saltvor 3 Jahren

curious as to how well this will work. from looking at their explanation, I feel like there is a possibility that it might not work as well as the examples in practice

Profilbild von Alex Volkov (Thursd/AI)

Alex Volkov (Thursd/AI)vor 3 Jahren

Anonymous authors with no code? Hm

Profilbild von Sudharshan

Sudharshanvor 3 Jahren

Wow this is big if it works as well as the examples!

Profilbild von Nilu Kulasingham

Nilu Kulasinghamvor 3 Jahren

space is moving so goddamn fast to keep up lol

Profilbild von Mg. Ing. Ernesto C. R. DataۗScientist GWUniversity

Mg. Ing. Ernesto C. R. DataۗScientist GWUniversityvor 3 Jahren

El artículo propone un enfoque de ajuste de dominio basado en codificador para una rápida personalización de modelos de texto a imagen. Referencia: R. Gal, M. Arar, Y. Atzmon, "Designing an Encoder for Fast Personalization of Text-to-Image Models"

Profilbild von Mg. Ing. Ernesto C. R. DataۗScientist GWUniversity

Mg. Ing. Ernesto C. R. DataۗScientist GWUniversityvor 3 Jahren

Summarizing, translation, keywords highlighting, and references formatting, all made by #ChatGPT. R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik y D. Cohen-Or, "Designing an Encoder for Fast Personalization of Text-to-Image Models", 23 Feb 2023.

Profilbild von Thib∞d

Thib∞dvor 3 Jahren

Sounds very interesting. Not zeroshot yet. I hope to see img prompt/mixing like MJ for SD soon!

Profilbild von Bobcat

Bobcatvor 3 Jahren

👀

Profilbild von Jaisurya

Jaisuryavor 3 Jahren

Each 10 steps looks adorable 😍

Ähnliche Videos

Expressive Text-to-Image Generation with Rich Text abs: project page:

Expressive Text-to-Image Generation with Rich Text abs: project page:

AK

209,253 Aufrufe • vor 3 Jahren

Check out Lumina-Image 2.0 🖼️ an efficient, unified, and transparent image generative model built with Gemma 2's text encoder and FLUX’s VAE 👇

Check out Lumina-Image 2.0 🖼️ an efficient, unified, and transparent image generative model built with Gemma 2's text encoder and FLUX’s VAE 👇

Google AI Developers

33,467 Aufrufe • vor 1 Jahr

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280 x 2048 abs: project page:

AK

718,746 Aufrufe • vor 3 Jahren

Can a small academic team build a strong text-to-image model using only public datasets? Introducing i1: a simple, fully open recipe for strong text-to-image models

Can a small academic team build a strong text-to-image model using only public datasets? Introducing i1: a simple, fully open recipe for strong text-to-image models

Zhuang Liu

60,774 Aufrufe • vor 2 Tagen

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 Aufrufe • vor 9 Monaten

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

AK

46,778 Aufrufe • vor 2 Jahren

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix AI

30,853 Aufrufe • vor 1 Jahr

🤯 OneDiffusion: A versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. ✅ Text to Image ✅ Image to Depth ✅ Image to Segmentation ✅ Image to Pose ✅ FaceID ✅ Image to Multiview How to use & more👇

🤯 OneDiffusion: A versatile, large-scale diffusion model that seamlessly supports bidirectional image synthesis and understanding across diverse tasks. ✅ Text to Image ✅ Image to Depth ✅ Image to Segmentation ✅ Image to Pose ✅ FaceID ✅ Image to Multiview How to use & more👇

Gradio

11,820 Aufrufe • vor 1 Jahr

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

AK

16,062 Aufrufe • vor 1 Jahr

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡

Mini-Omni 2 understands image, audio and text inputs all via end-to-end voice conversations with users 🔥 > Understands and processes images, speech, and text > Generates real-time speech responses > Supports interruptions during speech Technical Overview: > Concatenates image, audio, and text features for input. > Uses text-guided delayed parallel output for real-time speech > Involves encoder adaptation, modal alignment, and multimodal fine-tuning Best part: MIT licensed ⚡

Vaibhav (VB) Srivastav

45,409 Aufrufe • vor 1 Jahr

Introducing ConceptAttention, an approach to interpreting diffusion transformer models! Write a prompt, choose some concepts, generate an image, and get high-quality heatmaps of text concepts. Our method outperforms existing methods like cross attention. Link to demo 👇

Introducing ConceptAttention, an approach to interpreting diffusion transformer models! Write a prompt, choose some concepts, generate an image, and get high-quality heatmaps of text concepts. Our method outperforms existing methods like cross attention. Link to demo 👇

Alec Helbling

36,631 Aufrufe • vor 1 Jahr

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

Tom Yeh

67,790 Aufrufe • vor 2 Jahren

Text to image with midjourney and image to video with gen2 by @commonstyle

Text to image with midjourney and image to video with gen2 by @commonstyle

AK

668,116 Aufrufe • vor 2 Jahren

[1/5] Always wondered what people see when looking at a Rorschach test? SpaText - our recent #CVPR2023 paper from @MetaAI may give you a sneak peek! TL;DR: We extend text-to-image models with region-specific textual controllability. Project Page:

[1/5] Always wondered what people see when looking at a Rorschach test? SpaText - our recent #CVPR2023 paper from @MetaAI may give you a sneak peek! TL;DR: We extend text-to-image models with region-specific textual controllability. Project Page:

Omri Avrahami

19,389 Aufrufe • vor 3 Jahren

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

DiffSplat Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation DiffSplat is a generative framework to synthesize 3D Gaussian Splats from text prompts & single-view images in ⚡️ 1~2 seconds. It is fine-tuned directly from a pretrained text-to-image diffusion model.

AK

38,416 Aufrufe • vor 1 Jahr

StyleDrop: Text-to-Image Generation in Any Style introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. paper page:

StyleDrop: Text-to-Image Generation in Any Style introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. paper page:

AK

56,372 Aufrufe • vor 3 Jahren

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,572 Aufrufe • vor 9 Monaten

Introducing StyleDrop, a model that allows a significantly higher level of stylized text-to-image synthesis by using a few style reference images that describe the style for text-to-image generation, bypassing the burden of text prompt engineering. More→

Introducing StyleDrop, a model that allows a significantly higher level of stylized text-to-image synthesis by using a few style reference images that describe the style for text-to-image generation, bypassing the burden of text prompt engineering. More→

Google AI

80,357 Aufrufe • vor 2 Jahren

Higgsfield Mod for Minecraft is live. > prompt any building or city, even the Statue of Liberty > create paintings with text-to-image > snap a view and restyle it with image-to-image > make videos from a prompt with text-to-video. > animate in-game photos with image-to-video

Higgsfield Mod for Minecraft is live. > prompt any building or city, even the Statue of Liberty > create paintings with text-to-image > snap a view and restyle it with image-to-image > make videos from a prompt with text-to-video. > animate in-game photos with image-to-video

Higgsfield AI 🧩

452,190 Aufrufe • vor 19 Tagen

BUILD 🔥: Microsoft is preparing new image and voice models for the announcement on June 2. > MAI Voice 2, a multilingual model supporting 15 news languages and a wider range of emotional spectrum (check voice samples in the article) > MAI Transcribe 1.5, a new model for speech-to-text use cases. > MAI Image 2.5, already announced last week, is now available on LM Arena in preview. Compared to MAI Image 2, it supports file uploads and can be used for image editing.

BUILD 🔥: Microsoft is preparing new image and voice models for the announcement on June 2. > MAI Voice 2, a multilingual model supporting 15 news languages and a wider range of emotional spectrum (check voice samples in the article) > MAI Transcribe 1.5, a new model for speech-to-text use cases. > MAI Image 2.5, already announced last week, is now available on LM Arena in preview. Compared to MAI Image 2, it supports file uploads and can be used for image editing.

🚨 AI News | TestingCatalog

46,260 Aufrufe • vor 25 Tagen