Загрузка видео...

Не удалось загрузить видео

На главную

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space...

67,790 просмотров • 2 лет назад •via X (Twitter)

Комментарии: 9

Фото профиля Vincent Valentine (CEO of UnOpen.ai)
Vincent Valentine (CEO of UnOpen.ai)2 лет назад

Could language-vision models spark new intuitive human-computer interfaces? What opportunities await?

Фото профиля Ozyphus
Ozyphus2 лет назад

These are beautiful how do you generate these?

Фото профиля Tom Yeh
Tom Yeh2 лет назад

I drew each frame using powerpoint and a drawing tablet. This video has 102 frames. ✍️🏋️

Фото профиля Nikolay Karelin
Nikolay Karelin2 лет назад

Beautiful! Have you seen posts/articles on CLIP fine-tuning?

Фото профиля Tritonix
Tritonix2 лет назад

CLIP bridges the gap between words & pictures. How could this shape the way we interact with machines in the future? #CLIP #AI #FutureTech

Фото профиля Master Of Code Gl.
Master Of Code Gl.2 лет назад

Impressive breakdown of the CLIP model. Your work at the intersection of computer vision and NLP is commendable. At Master of Code, we excel in applying advanced AI like CLIP to revolutionize user interactions.

Фото профиля Maxin 🇬🇭
Maxin 🇬🇭2 лет назад

@threadreaderapp unroll

Фото профиля Thread Reader App
Thread Reader App2 лет назад

@ProfTomYeh @Maxin_check Hi! the unroll you asked for: Enjoy :) 🤖

Фото профиля @investornelson❤️XALLY🐬$BLUAI $BPAD Runes🔶⛺
@investornelson❤️XALLY🐬$BLUAI $BPAD Runes🔶⛺2 лет назад

@IntentAGI $OO @moonberg_ai $ZENT @ZentryHQ $MINE @MineProBusiness

Похожие видео

[Self-Attention] by Hand ✍️ Self-attention is what enables LLMs to understand context. How does it work? This exercise demonstrates how to calculate a 6-3 attention head by hand. Note that if we have two instances of this, we get 6-6 attention (i.e., multi-head attention, n=2). -- 𝗚𝗼𝗮𝗹 -- Transform [6D Features 🟧] to [3D Attention Weighted Features 🟦] -- 𝗪𝗮𝗹𝗸𝘁𝗵𝗿𝗼𝘂𝗴𝗵 -- [1] Given ↳ A set of 4 feature vectors (6-D): x1,x2,x3,x4 [2] Query, Key, Value ↳ Multiply features x's with linear transformation matrices WQ, WK, and WV, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4). ↳ "Self" refers to the fact that both queries and keys are derived from the same set of features. [3] 🟪 Prepare for MatMul ↳ Copy query vectors ↳ Copy the transpose of key vectors [4] 🟪 MatMul ↳ Multiply K^T and Q ↳ This is equivalent to taking dot product between every pair of query and key vectors. ↳ The purpose is to use dot product as an estimate of the "matching score" between every key-value pair. ↳ This estimate makes sense because dot product is the numerator of Cosine Similarity between two vectors. [5] 🟨 Scale ↳ Scale each element by the square root of dk, which is the dimension of key vectors (dk=3). ↳ The purpose is to normalize the impact of the dk on matching scores, even if we scale dk to 32, 64, or 128. ↳ To simplify hand calculation, we approximate [ □/sqrt(3) ] with [ floor(□/2) ]. [6] 🟩 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [7] 🟩 Softmax: ∑ ↳ Sum across each column [8] 🟩 Softmax: 1 / sum ↳ For each column, divide each element by the column sum ↳ The purpose is normalize each column so that the numbers sum to 1. In other words, each column is a probability distribution of attention, and we have four of them. ↳ The result is the Attention Weight Matrix (A) (yellow) [9] 🟦 MatMul ↳ Multiply the value vectors (Vs) with the Attention Weight Matrix (A) ↳ The results are the attention weighted features Zs. ↳ They are fed to the position-wise feed forward network in the next layer.

Tom Yeh

101,010 просмотров • 2 лет назад

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 просмотров • 9 месяцев назад

SORA by Hand ✍️ OpenAI’s #SORA took over the Internet when it was announced earlier this year. The technology behind Sora is the Diffusion Transformer (DiT) developed by William Peebles and Shining Xie. How does DiT work? 𝗚𝗼𝗮𝗹: Generate a video conditioned by a text prompt and a series of diffusion steps [1] Given ↳ Video ↳ Prompt: "sora is sky" ↳ Diffusion step: t = 3 [2] Video → Patches ↳ Divide all pixels in all frames into 4 spacetime patches [3] Visual Encoder: Pixels 🟨 → Latent 🟩 ↳ Multiply the patches with weights and biases, followed by ReLU ↳ The result is a latent feature vector per patch ↳ The purpose is dimension reduction from 4 (2x2x1) to 2 (2x1). ↳ In the paper, the reduction is 196,608 (256x256x3)→ 4096 (32x32x4) [4] ⬛ Add Noise ↳ Sample a noise according to the diffusion time step t. Typically, the larger the t, the smaller the noise. ↳ Add the Sampled Noise to latent features to obtain Noised Latent. ↳ The goal is to purposely add noise to a video and ask the model to guess what that noise is. ↳ This is analogous to training a language model by purposely deleting a word in a sentence and ask the model to guess what the deleted word was. [5-7] 🟪 Conditioning by Adaptive Layer Norm [5] Encode Conditions ↳ Encode "sora is sky" into a text embedding vector [0,1,-1]. ↳ Encode t = 3 to as a binary vector [1,1]. ↳ Concatenate the two vectors in to a 5D column vector. [6] Estimate Scale/Shift ↳ Multiply the combined vector with weights and biases ↳ The goal is to estimate the scale [2,-1] and shift [-1,5]. ↳ Copy the result to (X) and (+) [7] Apply Scale/Sift ↳ Scale the noised latent by [2,-1] ↳ Shifted the scaled noised latent by [-1, 5] ↳ The result is "conditioned" noise latent. [8-10] Transformer [8] Self-Attention ↳ Feed the conditioned noised latent to Query-Key function to obtain a self-attention matrix ↳ Value is omitted for simplicity [9] Attention Pooling ↳ Multiply the conditioned noised latent with the self-attention matrix ↳ The result are attention weighted features [10] Pointwise Feed Forward Network ↳ Multiply the attention weighted features with weights and biases ↳ The result is the Predicted Noise 🏋️‍♂️ 𝗧𝗿𝗮𝗶𝗻 [11] ↳ Calculate MSE loss gradients by taking the different between the Predicted Noise and the Sampled Noise (ground truth). ↳ Use the loss gradients to kick off backpropagation to update all learnable parameters (red borders) ↳ Note the visual encoder and decoder's parameters are frozen (blue borders) 🎨 𝗚𝗲𝗻𝗲𝗿𝗮𝘁𝗲 (𝗦𝗮𝗺𝗽𝗹𝗲) [12] Denoise ↳ Subtract the predicted noise from the noised latent to obtain the noise-free latent [13] Visual Decoder: Latent 🟩 → Pixels 🟨 ↳ Multiply the patches with weights and biases, followed by ReLU [14] Patches → Video ↳ Rearrange patches into a sequence of video frames.

Tom Yeh

238,097 просмотров • 2 лет назад

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,572 просмотров • 9 месяцев назад