Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: - a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, ,... show more

CalCo

20,583 subscribers

103,205 görüntüleme • 2 yıl önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

11 Yorum

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

Existing text-image embedding models, including OpenAI’s CLIP, dramatically underperform specialized text encoders on text retrieval tasks. This forces developers to deploy several embedding models and store several vector indices for multimodal applications. With Nomic-Embed-Vision, developers can use a single vector space to power both their text-text and text-image retrieval tasks.

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

We’ve been honored by the reception of Nomic-Embed-Text, which has grown into one of the most downloaded models on @huggingface. We designed Nomic-Embed-Vision to be compatible with Nomic-Embed-Text out of the box, making it easy for developers using Nomic-Embed-Text to extend their applications with multimodal features. Put simply, any vector created using Nomic-Embed-Text can be used to query vectors created by Nomic-Embed-Vision, and vice versa.

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

We are releasing Nomic-Embed-Vision under a CC-BY-NC-4.0 license. This will enable researchers and hackers to continue experimenting with our models, as well as enable Nomic to continue releasing great models in the future. As Nomic releases future models, we intend to apply Apache-2.0 licenses to the less recent models in our catalogue. You can download the model on @huggingface here!

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

We also worked with @LangChainAI and @llama_index to ensure day 1 compatibility with the model orchestration frameworks developers love:

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

If you want to use Nomic-Embed-Vision or Nomic-Embed-Text in production, we recommend using our @awscloud marketplace offering, and storing the vectors in a @MongoDB Atlas vector store:

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

You can also access the model through our python client and in our Nomic Embedding API.

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

To learn more about how we built Nomic-Embed-Vision, check out our blog post, and keep an eye out for our forthcoming technical report:

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

Nomic Embed was trained on @digitalocean compute, with early experiments made possible by a generous compute grant from @LambdaAPI.

andrew gao profil fotoğrafı

andrew gao2 yıl önce

I got early access and built this using Nomic-Embed-Vision! Check out a museum collection of 250,000 works of art!

txh 📟 profil fotoğrafı

txh 📟2 yıl önce

@Teknium1 this is really cool, what can I use to visualize/plot the cluster after computing the embeddings?

Nomic AI profil fotoğrafı

Nomic AI2 yıl önce

@Teknium1 You can use

Benzer Videolar

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

AI at Meta

121,530 görüntüleme • 1 yıl önce

jina-embeddings-v5-omni is here! Our first universal embedding model for text, images, audio, and video. Available in two sizes: small (1.57B, 1024-dim, 32K context) and nano (0.95B, 768-dim, 8K context). Both support Matryoshka truncation down to 32 dimensions. v5-omni is back-compatible: if you already use jina-embeddings-v5-text-small/nano, the existing text indexes work with v5-omni out of the box. Without reindexing the text, just index your multimodal content with v5-omni and start searching images, audio, and video.

jina-embeddings-v5-omni is here! Our first universal embedding model for text, images, audio, and video. Available in two sizes: small (1.57B, 1024-dim, 32K context) and nano (0.95B, 768-dim, 8K context). Both support Matryoshka truncation down to 32 dimensions. v5-omni is back-compatible: if you already use jina-embeddings-v5-text-small/nano, the existing text indexes work with v5-omni out of the box. Without reindexing the text, just index your multimodal content with v5-omni and start searching images, audio, and video.

Jina AI

134,015 görüntüleme • 2 ay önce

StarVector is out on Hugging Face StarVector is a foundation model for generating Scalable Vector Graphics (SVG) code from images and text. It utilizes a Vision-Language Modeling architecture to understand both visual and textual inputs, enabling high-quality vectorization and text-guided SVG creation.

StarVector is out on Hugging Face StarVector is a foundation model for generating Scalable Vector Graphics (SVG) code from images and text. It utilizes a Vision-Language Modeling architecture to understand both visual and textual inputs, enabling high-quality vectorization and text-guided SVG creation.

AK

254,342 görüntüleme • 1 yıl önce

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

Tom Yeh

67,834 görüntüleme • 2 yıl önce

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

New short course: Building Multimodal Search and RAG", by Weaviate AI Database's Sebastia(N_) Witalec ✊🏽✊🏾✊🏿. Contrastive learning is used to train models to map vectors into an embedding space by pulling similar concepts closer together and pushing dissimilar concepts away from each other. This technique is also used to train multimodal embedding models that capture semantic similarity across different modalities like text, images, and audio. These multimodal embeddings can be used to build multimodal search and RAG systems. In this course, you'll learn how contrastive learning works, and how to add multimodality to RAG – so your models can draw on diverse, relevant context to answer questions. For example, a query about a financial report might synthesize information from text snippets, graphs, tables, and slides. You will also learn how visual instruction tuning lets you integrate image understanding into language models, and build a multi-vector recommender system using Weaviate’s open-source vector database. Please sign up here:

Andrew Ng

104,371 görüntüleme • 2 yıl önce

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 görüntüleme • 10 ay önce

We’re expanding the Gemini API File Search tool 🔍 with 3 new updates that enable developers to more easily build multimodal RAG systems with enhanced precision: + Multimodal Support: By leveraging our Gemini Embedding 2 model, File Search can now reason across image and text simultaneously. + Custom Metadata Filtering: Bring structure to unstructured data by tagging files with custom key-value labels. This pre-filters your data and boosts search speed. + Exact citations: File Search can now capture and return the exact source (down to the page number) for every piece of information indexed. See multimodal File Search in action with our example app in Google AI Studio. Chat with your entire image and doc library, ask questions, and trace answers back to the source:

We’re expanding the Gemini API File Search tool 🔍 with 3 new updates that enable developers to more easily build multimodal RAG systems with enhanced precision: + Multimodal Support: By leveraging our Gemini Embedding 2 model, File Search can now reason across image and text simultaneously. + Custom Metadata Filtering: Bring structure to unstructured data by tagging files with custom key-value labels. This pre-filters your data and boosts search speed. + Exact citations: File Search can now capture and return the exact source (down to the page number) for every piece of information indexed. See multimodal File Search in action with our example app in Google AI Studio. Chat with your entire image and doc library, ask questions, and trace answers back to the source:

Google AI Developers

108,622 görüntüleme • 2 ay önce

HunyuanImage 3.0 is officially here. It's the first open-source model with industrial-grade native multimodal architecture and delivers instant, high-quality visual content for every designer and content creator. Watch this quick showcase to see its performance on generating intricate text, detailed comics, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

HunyuanImage 3.0 is officially here. It's the first open-source model with industrial-grade native multimodal architecture and delivers instant, high-quality visual content for every designer and content creator. Watch this quick showcase to see its performance on generating intricate text, detailed comics, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

31,048 görüntüleme • 9 ay önce

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

Today we released Meta Spirit LM — our first open source multimodal language model that freely mixes text and speech. Many existing AI voice experiences today use ASR to techniques to process speech before synthesizing with an LLM to generate text — but these approaches compromise the expressive aspects of speech. Using phonetic, pitch and tone tokens, Spirit LM models can overcome these limitations for both inputs and outputs to generate more natural sounding speech while also learning new tasks across ASR, TTS and speech classification. We hope that sharing this work will enable the research community to further new approaches for text and speech integration.

AI at Meta

351,739 görüntüleme • 1 yıl önce

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming, live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fung

Introducing VL-JEPA: Vision-Language Joint Embedding Predictive Architecture for streaming, live action recognition, retrieval, VQA, and classification tasks with better performance and higher efficiency than large VLMs. • VL-JEPA is the first non-generative model that can perform general-domain vision-language tasks in real-time, built on a joint embedding predictive architecture. • We demonstrate in controlled experiments that VL-JEPA, trained with latent space embedding prediction, outperforms VLMs that rely on data space token prediction. • We show that VL-JEPA delivers significant efficiency gains over VLMs for online video streaming applications, thanks to its non-autoregressive design and native support for selective decoding. • We highlight that our VL-JEPA model, with an unified model architecture, can effectively handle a wide range of classification, retrieval, and VQA tasks at the same time. by Delong Chen (陈德龙) Mustafa Shukor Théo Moutakanni Willy Jade Lei Yu Tejaswi Kasarla Allen Bolourchi Yann LeCun Pascale Fung

Pascale Fung

90,144 görüntüleme • 7 ay önce

Introducing Unsloth Studio ✨ A new open-source web UI to train and run LLMs. • Run models locally on Mac, Windows, Linux • Train 500+ models 2x faster with 70% less VRAM • Supports GGUF, vision, audio, embedding models • Auto-create datasets from PDF, CSV, DOCX • Self-healing tool calling and code execution • Compare models side by side + export to GGUF GitHub: Blog and Guide: Available now on Hugging Face, NVIDIA, Docker and Colab.

Introducing Unsloth Studio ✨ A new open-source web UI to train and run LLMs. • Run models locally on Mac, Windows, Linux • Train 500+ models 2x faster with 70% less VRAM • Supports GGUF, vision, audio, embedding models • Auto-create datasets from PDF, CSV, DOCX • Self-healing tool calling and code execution • Compare models side by side + export to GGUF GitHub: Blog and Guide: Available now on Hugging Face, NVIDIA, Docker and Colab.

Unsloth AI

1,703,221 görüntüleme • 4 ay önce

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 görüntüleme • 3 yıl önce

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,658 görüntüleme • 9 ay önce

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,825 görüntüleme • 1 yıl önce

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Stability AI

976,344 görüntüleme • 2 yıl önce

⚡️We are excited to announce that our new no-code Enterprise Platform is NOW available in private beta! As RAG apps advance from prototype to production we’ve been overwhelmed by requests for an enterprise grade solution to provide these applications with the data they need. Designed to make it easy to get your data #RAGready, our Platform can preprocess more than 25 file types and soon will be fully #multimodal, also able to ingest audio, video and image files. We ship with a baseline suite of source connectors, including Amazon Web Services S3, Microsoft Azure Blob Storage, OneDrive, SFTP, Databricks Delta Table, Google Drive, Salesforce, Elastic, OpenSearch, and Google Cloud storage with many more fast following. Platform transforms your documents into a standardized JSON schema, broken down into semantically coherent elements allowing you to reconstruct your document in the manner most useful to you. Want only the narrative text but not the headers and footers? This is entirely configurable through the UI. Additionally, we generate more than 30 types of metadata for each element to make it easy to curate the data being written downstream and to support metadata filtering during retrieval. Smart chunking and the ability to choose from a range of embedding models are in from launch, delivering a turnkey solution for chunk and embedding experimentation. As for destination connectors, we've got that covered too, with Amazon Web Services S3, Pinecone, Chroma , Weaviate AI Database, Google Cloud storage, MongoDB, Microsoft Azure cognitive search, PostgreSQL, Elastic, OpenSearch, and Databricks Delta Table. And of course, all of this can be scheduled to keep your data continuously hydrated. The private-beta is live today! Sign-up to get access and come build the future of LLM data foundations with us: 🚀 #ETLforLLMs #AI #DataPreprocessing #DataScience #DataTransformation #LLMs #ETL #ML #PreppingData #MachineLearning #RAG #Engineer #Unstructured #Unstructuredio #RetrievalAugmentedGeneration #multimodal #AIJobs

⚡️We are excited to announce that our new no-code Enterprise Platform is NOW available in private beta! As RAG apps advance from prototype to production we’ve been overwhelmed by requests for an enterprise grade solution to provide these applications with the data they need. Designed to make it easy to get your data #RAGready, our Platform can preprocess more than 25 file types and soon will be fully #multimodal, also able to ingest audio, video and image files. We ship with a baseline suite of source connectors, including Amazon Web Services S3, Microsoft Azure Blob Storage, OneDrive, SFTP, Databricks Delta Table, Google Drive, Salesforce, Elastic, OpenSearch, and Google Cloud storage with many more fast following. Platform transforms your documents into a standardized JSON schema, broken down into semantically coherent elements allowing you to reconstruct your document in the manner most useful to you. Want only the narrative text but not the headers and footers? This is entirely configurable through the UI. Additionally, we generate more than 30 types of metadata for each element to make it easy to curate the data being written downstream and to support metadata filtering during retrieval. Smart chunking and the ability to choose from a range of embedding models are in from launch, delivering a turnkey solution for chunk and embedding experimentation. As for destination connectors, we've got that covered too, with Amazon Web Services S3, Pinecone, Chroma , Weaviate AI Database, Google Cloud storage, MongoDB, Microsoft Azure cognitive search, PostgreSQL, Elastic, OpenSearch, and Databricks Delta Table. And of course, all of this can be scheduled to keep your data continuously hydrated. The private-beta is live today! Sign-up to get access and come build the future of LLM data foundations with us: 🚀 #ETLforLLMs #AI #DataPreprocessing #DataScience #DataTransformation #LLMs #ETL #ML #PreppingData #MachineLearning #RAG #Engineer #Unstructured #Unstructuredio #RetrievalAugmentedGeneration #multimodal #AIJobs

Unstructured

21,874 görüntüleme • 2 yıl önce

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 görüntüleme • 1 yıl önce

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Today we're announcing the open-source release of HunyuanVideo-Foley, our new end-to-end Text-Video-to-Audio (TV2A) framework for generating high-fidelity audio.🚀 This tool empowers creators in video production, filmmaking, and game development to generate professional-grade audio that precisely aligns with visual dynamics and semantic context, addressing key challenges in V2A generation.🔊 Key Innovations: 🔹Exceptional Generalization: Trained on a massive 100k-hour multimodal dataset, the model generates contextually-aware soundscapes for a wide range of scenes, from natural landscapes to animated shorts. 🔹Balanced Multimodal Response: Our innovative multimodal diffusion transformer (MMDiT) architecture ensures the model balances video and text cues, generating rich, layered sound effects that capture every detail—from the main subject to subtle background elements. 🔹High-Fidelity Audio: Using a Representation Alignment (REPA) loss function and a powerful Audio VAE, we've improved generation stability and producing professional-grade audio, free of noise and inconsistencies. HunyuanVideo-Foley achieves SOTA on multiple benchmarks, surpassing all open-source models in audio quality, visual-semantic alignment, and temporal alignment. 👉Try it now: 🌐Project Page: 🔗Code: 📄Technical Report: 🤗Hugging Face:

Tencent Hy

122,706 görüntüleme • 10 ay önce

InternLM-XComposer-2.5 A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.

InternLM-XComposer-2.5 A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.

AK

66,472 görüntüleme • 2 yıl önce