Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space... for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).show more

Tom Yeh

56,137 subscribers

67,790 просмотров • 2 лет назад •via X (Twitter)

Наука и технологии Образование

Anya Rossi• Live Now

Private livecam show

Комментарии: 9

Фото профиля Vincent Valentine (CEO of UnOpen.ai)

Vincent Valentine (CEO of UnOpen.ai)2 лет назад

Could language-vision models spark new intuitive human-computer interfaces? What opportunities await?

Фото профиля Ozyphus

Ozyphus2 лет назад

These are beautiful how do you generate these?

Фото профиля Tom Yeh

Tom Yeh2 лет назад

I drew each frame using powerpoint and a drawing tablet. This video has 102 frames. ✍️🏋️

Фото профиля Nikolay Karelin

Nikolay Karelin2 лет назад

Beautiful! Have you seen posts/articles on CLIP fine-tuning?

Фото профиля Tritonix

Tritonix2 лет назад

CLIP bridges the gap between words & pictures. How could this shape the way we interact with machines in the future? #CLIP #AI #FutureTech

Фото профиля Master Of Code Gl.

Master Of Code Gl.2 лет назад

Impressive breakdown of the CLIP model. Your work at the intersection of computer vision and NLP is commendable. At Master of Code, we excel in applying advanced AI like CLIP to revolutionize user interactions.

Фото профиля Maxin 🇬🇭

Maxin 🇬🇭2 лет назад

@threadreaderapp unroll

Фото профиля Thread Reader App

Thread Reader App2 лет назад

@ProfTomYeh @Maxin_check Hi! the unroll you asked for: Enjoy :) 🤖

Фото профиля @investornelson❤️XALLY🐬$BLUAI $BPAD Runes🔶⛺

@investornelson❤️XALLY🐬$BLUAI $BPAD Runes🔶⛺2 лет назад

@IntentAGI $OO @moonberg_ai $ZENT @ZentryHQ $MINE @MineProBusiness

Похожие видео

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

Tom Yeh

101,010 просмотров • 2 лет назад

Text is often the hardest part of image generation to get right. MAI-Image-2 improves consistency and legibility for in-image text across infographics, diagrams, and slides — reducing the gap between prompt and output. Try it for yourself.

Text is often the hardest part of image generation to get right. MAI-Image-2 improves consistency and legibility for in-image text across infographics, diagrams, and slides — reducing the gap between prompt and output. Try it for yourself.

Microsoft AI

30,778 просмотров • 2 месяцев назад

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

Designing an Encoder for Fast Personalization of Text-to-Image Models TL;DR: use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps abs: project page:

AK

165,158 просмотров • 3 лет назад

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

AK

46,778 просмотров • 2 лет назад

Introducing StyleDrop, a model that allows a significantly higher level of stylized text-to-image synthesis by using a few style reference images that describe the style for text-to-image generation, bypassing the burden of text prompt engineering. More→

Introducing StyleDrop, a model that allows a significantly higher level of stylized text-to-image synthesis by using a few style reference images that describe the style for text-to-image generation, bypassing the burden of text prompt engineering. More→

Google AI

80,357 просмотров • 2 лет назад

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

AK

16,062 просмотров • 1 год назад

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 просмотров • 9 месяцев назад

The winner of Lovable's weekend competition: Kolbo ai - A powerful tool to help make all sorts of social media content with AI Features of the winning app: - Supabase for backend - Project-based organization system - OpenAI for text & image generation - Anthropic for text generation - Google Gemini for text generation - Midjourney for image generation - for image generation - Text-to-speech - Speech-to-text - Stripe for payments - mu for music generation Built by Zohar Vanunu 👇

The winner of Lovable's weekend competition: Kolbo ai - A powerful tool to help make all sorts of social media content with AI Features of the winning app: - Supabase for backend - Project-based organization system - OpenAI for text & image generation - Anthropic for text generation - Google Gemini for text generation - Midjourney for image generation - for image generation - Text-to-speech - Speech-to-text - Stripe for payments - mu for music generation Built by Zohar Vanunu 👇

Lovable

35,841 просмотров • 1 год назад

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix GenAI’s new generative AI model Image-to-3D is now online. Create high fidelity 3D models importable to Unreal Engine 5 or Unity simply by sending an original image into PhoenixLLM. Better yet, generate the source image using Phoenix GenAI’s Flux and feed it into Image-to-3D. This release marks yet another upgrade of GenAI’s arsenal of capabilities, getting it ready for multi-workflow GenAI agents, in which users will be able to combine text-to-image, image-to-prompt, text-to-video, text-to-3D, and image-to-3D into complex multi-step workflows with simple commands via PhoenixLLM. Image-to-3D is yet another addition to Phoenix’s Vertical AI Solutions for gaming, content, and metaverse. Users are able to use it as a Phoenix-native alternative to SkyNet AI Marketplace’s Tripo Integration earlier this year. #Phoenix $PHB

Phoenix AI

30,853 просмотров • 1 год назад

SORA by Hand ✍️ OpenAI’s #SORA took over the Internet when it was announced earlier this year. The technology behind Sora is the Diffusion Transformer (DiT) developed by William Peebles and Shining Xie. How does DiT work? 𝗚𝗼𝗮𝗹: Generate a video conditioned by a text prompt and a series of diffusion steps [1] Given ↳ Video ↳ Prompt: "sora is sky" ↳ Diffusion step: t = 3 [2] Video → Patches ↳ Divide all pixels in all frames into 4 spacetime patches [3] Visual Encoder: Pixels 🟨 → Latent 🟩 ↳ Multiply the patches with weights and biases, followed by ReLU ↳ The result is a latent feature vector per patch ↳ The purpose is dimension reduction from 4 (2x2x1) to 2 (2x1). ↳ In the paper, the reduction is 196,608 (256x256x3)→ 4096 (32x32x4) [4] ⬛ Add Noise ↳ Sample a noise according to the diffusion time step t. Typically, the larger the t, the smaller the noise. ↳ Add the Sampled Noise to latent features to obtain Noised Latent. ↳ The goal is to purposely add noise to a video and ask the model to guess what that noise is. ↳ This is analogous to training a language model by purposely deleting a word in a sentence and ask the model to guess what the deleted word was. [5-7] 🟪 Conditioning by Adaptive Layer Norm [5] Encode Conditions ↳ Encode "sora is sky" into a text embedding vector [0,1,-1]. ↳ Encode t = 3 to as a binary vector [1,1]. ↳ Concatenate the two vectors in to a 5D column vector. [6] Estimate Scale/Shift ↳ Multiply the combined vector with weights and biases ↳ The goal is to estimate the scale [2,-1] and shift [-1,5]. ↳ Copy the result to (X) and (+) [7] Apply Scale/Sift ↳ Scale the noised latent by [2,-1] ↳ Shifted the scaled noised latent by [-1, 5] ↳ The result is "conditioned" noise latent. [8-10] Transformer [8] Self-Attention ↳ Feed the conditioned noised latent to Query-Key function to obtain a self-attention matrix ↳ Value is omitted for simplicity [9] Attention Pooling ↳ Multiply the conditioned noised latent with the self-attention matrix ↳ The result are attention weighted features [10] Pointwise Feed Forward Network ↳ Multiply the attention weighted features with weights and biases ↳ The result is the Predicted Noise 🏋️‍♂️ 𝗧𝗿𝗮𝗶𝗻 [11] ↳ Calculate MSE loss gradients by taking the different between the Predicted Noise and the Sampled Noise (ground truth). ↳ Use the loss gradients to kick off backpropagation to update all learnable parameters (red borders) ↳ Note the visual encoder and decoder's parameters are frozen (blue borders) 🎨 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 (𝗦𝗮𝗺𝗽𝗹𝗲) [12] Denoise ↳ Subtract the predicted noise from the noised latent to obtain the noise-free latent [13] Visual Decoder: Latent 🟩 → Pixels 🟨 ↳ Multiply the patches with weights and biases, followed by ReLU [14] Patches → Video ↳ Rearrange patches into a sequence of video frames.

SORA by Hand ✍️ OpenAI’s #SORA took over the Internet when it was announced earlier this year. The technology behind Sora is the Diffusion Transformer (DiT) developed by William Peebles and Shining Xie. How does DiT work? 𝗚𝗼𝗮𝗹: Generate a video conditioned by a text prompt and a series of diffusion steps [1] Given ↳ Video ↳ Prompt: "sora is sky" ↳ Diffusion step: t = 3 [2] Video → Patches ↳ Divide all pixels in all frames into 4 spacetime patches [3] Visual Encoder: Pixels 🟨 → Latent 🟩 ↳ Multiply the patches with weights and biases, followed by ReLU ↳ The result is a latent feature vector per patch ↳ The purpose is dimension reduction from 4 (2x2x1) to 2 (2x1). ↳ In the paper, the reduction is 196,608 (256x256x3)→ 4096 (32x32x4) [4] ⬛ Add Noise ↳ Sample a noise according to the diffusion time step t. Typically, the larger the t, the smaller the noise. ↳ Add the Sampled Noise to latent features to obtain Noised Latent. ↳ The goal is to purposely add noise to a video and ask the model to guess what that noise is. ↳ This is analogous to training a language model by purposely deleting a word in a sentence and ask the model to guess what the deleted word was. [5-7] 🟪 Conditioning by Adaptive Layer Norm [5] Encode Conditions ↳ Encode "sora is sky" into a text embedding vector [0,1,-1]. ↳ Encode t = 3 to as a binary vector [1,1]. ↳ Concatenate the two vectors in to a 5D column vector. [6] Estimate Scale/Shift ↳ Multiply the combined vector with weights and biases ↳ The goal is to estimate the scale [2,-1] and shift [-1,5]. ↳ Copy the result to (X) and (+) [7] Apply Scale/Sift ↳ Scale the noised latent by [2,-1] ↳ Shifted the scaled noised latent by [-1, 5] ↳ The result is "conditioned" noise latent. [8-10] Transformer [8] Self-Attention ↳ Feed the conditioned noised latent to Query-Key function to obtain a self-attention matrix ↳ Value is omitted for simplicity [9] Attention Pooling ↳ Multiply the conditioned noised latent with the self-attention matrix ↳ The result are attention weighted features [10] Pointwise Feed Forward Network ↳ Multiply the attention weighted features with weights and biases ↳ The result is the Predicted Noise 🏋️‍♂️ 𝗧𝗿𝗮𝗶𝗻 [11] ↳ Calculate MSE loss gradients by taking the different between the Predicted Noise and the Sampled Noise (ground truth). ↳ Use the loss gradients to kick off backpropagation to update all learnable parameters (red borders) ↳ Note the visual encoder and decoder's parameters are frozen (blue borders) 🎨 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 (𝗦𝗮𝗺𝗽𝗹𝗲) [12] Denoise ↳ Subtract the predicted noise from the noised latent to obtain the noise-free latent [13] Visual Decoder: Latent 🟩 → Pixels 🟨 ↳ Multiply the patches with weights and biases, followed by ReLU [14] Patches → Video ↳ Rearrange patches into a sequence of video frames.

Tom Yeh

238,097 просмотров • 2 лет назад

Higgsfield Mod for Minecraft is live. > prompt any building or city, even the Statue of Liberty > create paintings with text-to-image > snap a view and restyle it with image-to-image > make videos from a prompt with text-to-video. > animate in-game photos with image-to-video

Higgsfield Mod for Minecraft is live. > prompt any building or city, even the Statue of Liberty > create paintings with text-to-image > snap a view and restyle it with image-to-image > make videos from a prompt with text-to-video. > animate in-game photos with image-to-video

Higgsfield AI 🧩

453,351 просмотров • 24 дней назад

Most don't know (1) how easy it is to invert embedding vectors back into sentences, (2) this is a perfect task text diffusion models. Here's a 78M parameter model and live demo that recovers 80% of tokens from Qwen3-Embedding and EmbeddingGemma vectors. Works even on multilingual input.

Most don't know (1) how easy it is to invert embedding vectors back into sentences, (2) this is a perfect task text diffusion models. Here's a 78M parameter model and live demo that recovers 80% of tokens from Qwen3-Embedding and EmbeddingGemma vectors. Works even on multilingual input.

Jina AI

12,813 просмотров • 4 месяцев назад

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

"This is how GPT-4 sees and hears itself" I used GPT-4 to describe itself. Then I used its description to generate an image, a video based on this image and a soundtrack. Tools I used: GPT-4, Midjourney, Kainber AI, Mubert, RunwayML This is the description I used that GPT-4 had of itself as a prompt to text-to-image, image-to-video, and text-to-music. I put the video and sound together in RunwayML.

Kris Kashtanova

1,233,420 просмотров • 3 лет назад

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,572 просмотров • 9 месяцев назад

Run InstantStyle Locally with 1 Click InstantStyle lets you generate images with a style of ANY other image, instantly. No LoRA required. Both text-to-image/image-to-image. I wrote a 1 click launcher for the gradio app from Frank (Haofan) Wang (The author of InstantStyle/InstantId!).

Run InstantStyle Locally with 1 Click InstantStyle lets you generate images with a style of ANY other image, instantly. No LoRA required. Both text-to-image/image-to-image. I wrote a 1 click launcher for the gradio app from Frank (Haofan) Wang (The author of InstantStyle/InstantId!).

cocktail peanut

39,104 просмотров • 2 лет назад

Can a small academic team build a strong text-to-image model using only public datasets? Introducing i1: a simple, fully open recipe for strong text-to-image models

Can a small academic team build a strong text-to-image model using only public datasets? Introducing i1: a simple, fully open recipe for strong text-to-image models

Zhuang Liu

65,784 просмотров • 7 дней назад

Struggling with slow inference of diffusion and flow models? Check out the video below—I’ve been using our new FastGen library to achieve 7-28x acceleration for text-2-image and {text,image,video}-2-video generation without sacrificing visual fidelity!

Struggling with slow inference of diffusion and flow models? Check out the video below—I’ve been using our new FastGen library to achieve 7-28x acceleration for text-2-image and {text,image,video}-2-video generation without sacrificing visual fidelity!

Julius Berner

13,623 просмотров • 4 месяцев назад

Make a few more of these ... Make a LOT more of these ... Gemini 2.0 native image output is enabling a new way to prompt: instructing with image and text together. Subtle shifts in how I draw change how Gemini interprets the same text prompt.

Make a few more of these ... Make a LOT more of these ... Gemini 2.0 native image output is enabling a new way to prompt: instructing with image and text together. Subtle shifts in how I draw change how Gemini interprets the same text prompt.

Alexander Chen

23,601 просмотров • 1 год назад

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: - a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, , Hugging Face, Amazon Web Services, DigitalOcean, Lambda

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: - a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, , Hugging Face, Amazon Web Services, DigitalOcean, Lambda

CalCo

103,205 просмотров • 2 лет назад

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Stability AI

976,312 просмотров • 2 лет назад