Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

jina-embeddings-v5-omni is here! Our first universal embedding model for text, images, audio, and video. Available in two sizes: small (1.57B, 1024-dim, 32K context) and nano (0.95B, 768-dim, 8K context). Both support Matryoshka truncation down to 32 dimensions. v5-omni is back-compatible: if you already use jina-embeddings-v5-text-small/nano, the existing text indexes... show more

Jina AI

17,310 subscribers

134,646 görüntüleme • 2 ay önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

$🩷 Nano Banana 2 Lite 2 Lite is my go-to now for image generation and editing! Very, *very* high quality, at only a small fraction of the price of our other 🍌s. ✨Bonus: Omni is now available via the API! Go forth and automate creating and editing *even videos* with natural language, not just images.$

🩷 Nano Banana 2 Lite 2 Lite is my go-to now for image generation and editing! Very, very high quality, at only a small fraction of the price of our other 🍌s. ✨Bonus: Omni is now available via the API! Go forth and automate creating and editing even videos with natural language, not just images.

👩‍💻 Paige Bailey

11,273 görüntüleme • 1 ay önce

$We’re shipping two major updates to streamline your creative workflow, allowing you to generate high-speed images with one model and then instantly animate them with the other—all at a fraction of the cost 🍌⚡️ 1️⃣ Introducing Nano Banana 2 Lite: Our fastest and most cost-efficient Gemini Image model yet delivers text-to-image outputs in under 4 seconds. Now available via the Gemini API and Google AI Studio, and rolling out soon across @NotebookLM, Google Flow, Google Gemini, Stitch by Google, Google Search and Google Photos. 2️⃣ Gemini Omni Flash in Public Preview: Our natively multimodal model for cost-efficient video generation and conversational editing. Now available via the Gemini API, Google AI Studio, and Gemini Enterprise Agent Platform so you can integrate the model into your workflow. While exciting on their own, the real magic happens when you build using these models together. Watch how our interior design demo integrates Nano Banana 2 Lite and Omni to instantly reimagine any space. Upload a photo, swipe through tailored design concepts, and see Omni bring the details to life in cinematic motion. Try out the demo app in AI Studio:$

We’re shipping two major updates to streamline your creative workflow, allowing you to generate high-speed images with one model and then instantly animate them with the other—all at a fraction of the cost 🍌⚡️ 1️⃣ Introducing Nano Banana 2 Lite: Our fastest and most cost-efficient Gemini Image model yet delivers text-to-image outputs in under 4 seconds. Now available via the Gemini API and Google AI Studio, and rolling out soon across @NotebookLM, Google Flow, Google Gemini, Stitch by Google, Google Search and Google Photos. 2️⃣ Gemini Omni Flash in Public Preview: Our natively multimodal model for cost-efficient video generation and conversational editing. Now available via the Gemini API, Google AI Studio, and Gemini Enterprise Agent Platform so you can integrate the model into your workflow. While exciting on their own, the real magic happens when you build using these models together. Watch how our interior design demo integrates Nano Banana 2 Lite and Omni to instantly reimagine any space. Upload a photo, swipe through tailored design concepts, and see Omni bring the details to life in cinematic motion. Try out the demo app in AI Studio:

Google AI

124,194 görüntüleme • 1 ay önce

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

New short course Multimodal RAG: Chat with Videos, developed with Intel and taught by vasudevlal! In this course, you’ll work with LLaVA (Large Language and Vision Assistant), a Large Vision Language Model (LVLM) that can process both images and text. For example, given an image of a person doing a handstand on a skateboard at the beach, LLaVA doesn't just caption the scene, it’s able to predict possible outcomes, like the person losing balance or falling off. By understanding not just what's in a video frame, but what might happen next, your application can provide more insightful answers to questions about video. You'll build a full multimodal RAG pipeline that can chat about video content: - Use the BridgeTower model to create joint text-image embeddings in a 512-dimensional multimodal semantic space. - Learn video processing techniques to extract keyframes, generate transcripts using Whisper, and create captions. - Use the LanceDB vector database to store and retrieve high-dimensional multimodal embeddings. - Integrate the LLaVA model, combining CLIP's (Contrastive Language Image Pretraining) vision transformer with Llama, for advanced visual-textual reasoning. Your final system will ingest video data, generate embeddings for frames and text, perform similarity searches for relevant content, and use the retrieved multimodal context to inform LVLM-based response generation. The result is a system capable of answering nuanced questions about video content, effectively chatting about the video it has processed. Please sign up here!

Andrew Ng

107,825 görüntüleme • 1 yıl önce

We asked Logan Kilpatrick what stood out most from Google I/O. His answer: Gemini Omni, Google's new multimodal AI model. "The model can take in any input and produce any output. Text, audio, video, image." "You get a bunch of really interesting capability transfer when you bring it all into a single model." Right now the killer use case is video editing. "It's like having a VFX studio on demand."

We asked Logan Kilpatrick what stood out most from Google I/O. His answer: Gemini Omni, Google's new multimodal AI model. "The model can take in any input and produce any output. Text, audio, video, image." "You get a bunch of really interesting capability transfer when you bring it all into a single model." Right now the killer use case is video editing. "It's like having a VFX studio on demand."

MTS

18,884 görüntüleme • 2 ay önce

The possibilities are endless with Dreamina AI Seedance 2.0! I’ve been testing its omni-features, and it’s seriously impressive. It can handle multiple inputs at once, images, video, and audio, and still understand the full context perfectly. The biggest breakthrough? Seamless character swapping in existing videos with high precision. This is a huge step forward for creators. And the good news Dreamina Seedance 2.0 Fast is available for free trial to all users. What do you think? #dreamina #dreaminapartner #seedance #dreaminaseedance2 #seedream

The possibilities are endless with Dreamina AI Seedance 2.0! I’ve been testing its omni-features, and it’s seriously impressive. It can handle multiple inputs at once, images, video, and audio, and still understand the full context perfectly. The biggest breakthrough? Seamless character swapping in existing videos with high precision. This is a huge step forward for creators. And the good news Dreamina Seedance 2.0 Fast is available for free trial to all users. What do you think? #dreamina #dreaminapartner #seedance #dreaminaseedance2 #seedream

Dinda Prasetyo

10,524 görüntüleme • 4 ay önce

Here is my ultimate guide for using Seedance 2.0 Omni, the most powerful feature for AI video creation. Tutorial linked in the second post. It's an absolute powerhouse for AI Film Making. I'm sharing a whole bunch of useful tips and tricks here from consistency for AI film making, consistent voices, music video creation, creating advertisements, using the power of video extension, inpainting tips and tricks etc. If you value consistency, you should 100% be using the Omni feature. It let's you reference images, videos and even audio to create scenes. I'll showcase various ways to unlock it's true potential. Once you try Omni you'll be asking yourself why you haven't started using it earlier, it's really amazing! This video is jam packed with value. Enjoy!

Here is my ultimate guide for using Seedance 2.0 Omni, the most powerful feature for AI video creation. Tutorial linked in the second post. It's an absolute powerhouse for AI Film Making. I'm sharing a whole bunch of useful tips and tricks here from consistency for AI film making, consistent voices, music video creation, creating advertisements, using the power of video extension, inpainting tips and tricks etc. If you value consistency, you should 100% be using the Omni feature. It let's you reference images, videos and even audio to create scenes. I'll showcase various ways to unlock it's true potential. Once you try Omni you'll be asking yourself why you haven't started using it earlier, it's really amazing! This video is jam packed with value. Enjoy!

Travis Davids

14,351 görüntüleme • 3 ay önce

If you're building a PDF RAG pipeline: Should you be using OCR and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 methods, or just 𝗲𝗺𝗯𝗲𝗱 𝗶𝗺𝗮𝗴𝗲𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆 using late interaction models? This paper says the answer might actually be 𝘣𝘰𝘵𝘩. My colleagues at Weaviate released IRPAPERS, a benchmark comparing 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 retrieval over 3,230 pages from 166 scientific papers. The setup: Take the same PDFs and process them two ways. For text, run OCR with GPT-4.1 and embed with Arctic 2.0 + BM25 hybrid search. For images, embed raw page images with ColModernVBERT multi-vector embeddings. Test both on 180 needle-in-the-haystack questions. 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: Text edges out images at the top rank: 46% vs 43% Recall@1 But images match or exceed text at deeper recall: 93% vs 91% Recall@20 But text and image based methods actually fail on 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘁 𝘲𝘶𝘦𝘳𝘪𝘦𝘴. At Recall@1: • 22 queries succeed with text but fail with images • 18 queries succeed with images but fail with text This complementarity is what makes 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗛𝘆𝗯𝗿𝗶𝗱 𝗦𝗲𝗮𝗿𝗰𝗵 work. By fusing scores from both text and image retrieval, they achieved: • 49% Recall@1 (beating either modality alone) • 81% Recall@5 • 95% Recall@20 More in the video below 🔽 Dataset: Paper: Code:

If you're building a PDF RAG pipeline: Should you be using OCR and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 𝗿𝗲𝘁𝗿𝗶𝗲𝘃𝗮𝗹 methods, or just 𝗲𝗺𝗯𝗲𝗱 𝗶𝗺𝗮𝗴𝗲𝘀 𝗱𝗶𝗿𝗲𝗰𝘁𝗹𝘆 using late interaction models? This paper says the answer might actually be 𝘣𝘰𝘵𝘩. My colleagues at Weaviate released IRPAPERS, a benchmark comparing 𝗶𝗺𝗮𝗴𝗲-𝗯𝗮𝘀𝗲𝗱 and 𝘁𝗲𝘅𝘁-𝗯𝗮𝘀𝗲𝗱 retrieval over 3,230 pages from 166 scientific papers. The setup: Take the same PDFs and process them two ways. For text, run OCR with GPT-4.1 and embed with Arctic 2.0 + BM25 hybrid search. For images, embed raw page images with ColModernVBERT multi-vector embeddings. Test both on 180 needle-in-the-haystack questions. 𝗧𝗵𝗲 𝗿𝗲𝘀𝘂𝗹𝘁𝘀: Text edges out images at the top rank: 46% vs 43% Recall@1 But images match or exceed text at deeper recall: 93% vs 91% Recall@20 But text and image based methods actually fail on 𝘥𝘪𝘧𝘧𝘦𝘳𝘦𝘯𝘁 𝘲𝘶𝘦𝘳𝘪𝘦𝘴. At Recall@1: • 22 queries succeed with text but fail with images • 18 queries succeed with images but fail with text This complementarity is what makes 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗛𝘆𝗯𝗿𝗶𝗱 𝗦𝗲𝗮𝗿𝗰𝗵 work. By fusing scores from both text and image retrieval, they achieved: • 49% Recall@1 (beating either modality alone) • 81% Recall@5 • 95% Recall@20 More in the video below 🔽 Dataset: Paper: Code:

Victoria Slocum

43,996 görüntüleme • 4 ay önce

Rolling out today we are launching Nano Banana Pro, the world’s best image model built to move beyond casual creation and into a new era of studio-quality, functional design. Nano Banana Pro enables a new level of precision and creative control, transforming the way you bring ideas to life. Here are a couple of our favorite new features: — Text rendering and translation: Generate crystal-clear text directly within your images. With the model’s advanced language understanding, you can even translate and regenerate visuals with localized text. — World knowledge: By connecting to Search’s vast knowledge base, Nano Banana Pro generates factually accurate diagrams and realistic product placements, making it an invaluable tool for learning and communication.

Rolling out today we are launching Nano Banana Pro, the world’s best image model built to move beyond casual creation and into a new era of studio-quality, functional design. Nano Banana Pro enables a new level of precision and creative control, transforming the way you bring ideas to life. Here are a couple of our favorite new features: — Text rendering and translation: Generate crystal-clear text directly within your images. With the model’s advanced language understanding, you can even translate and regenerate visuals with localized text. — World knowledge: By connecting to Search’s vast knowledge base, Nano Banana Pro generates factually accurate diagrams and realistic product placements, making it an invaluable tool for learning and communication.

Google AI

337,716 görüntüleme • 8 ay önce

✨ I made my first video game with ChatGPT: 1) ChatGPT generates a text-based adventure game with DALL-E 3 generating images for it 2) Every time you play the game is different because it generates the story and images live 3) The images from DALL-E are sent to Runway which turns images into video 4) The text is sent to ElevenLabs which turns the text adventure into a pirate narrator voice 5) It's merged into a video 6) Interactive buttons are overlayed The game is called: 🐒🏝️🇳🇱The Secret of Monkey Island: Amsterdam (unofficial) And you can play it here: (video + TTS + buttons doesn't work auto yet, for now manual but text + img works, I'm building an interface for it now)

✨ I made my first video game with ChatGPT: 1) ChatGPT generates a text-based adventure game with DALL-E 3 generating images for it 2) Every time you play the game is different because it generates the story and images live 3) The images from DALL-E are sent to Runway which turns images into video 4) The text is sent to ElevenLabs which turns the text adventure into a pirate narrator voice 5) It's merged into a video 6) Interactive buttons are overlayed The game is called: 🐒🏝️🇳🇱The Secret of Monkey Island: Amsterdam (unofficial) And you can play it here: (video + TTS + buttons doesn't work auto yet, for now manual but text + img works, I'm building an interface for it now)

@levelsio

2,725,011 görüntüleme • 2 yıl önce

With the latest Seedance 2.0 release, there’s a feature we think might be even more transformative than the base video model itself: Seedance Omni. Similar to Kling Omni, Luma Modify, and Runway Aleph, Seedance Omni lets you guide the AI and make targeted edits to an existing video clip. It supports up to 9 reference images, 3 video clips, and 3 audio clips, allowing it to synthesize multiple layers of creative direction. We tested it across a range of scenarios (full prompts in the 🧵) 1. Modify eye color 2. Change weather 3. Time travel effect 4. Character swap 5. Add a spaceship 6. Change asteroids to meatballs 7. Dragon emerging from the clouds Verdict🏆: Seedance Omni excels at physical video dynamics, large visual effects, and environmental transformations. Its main weakness is resolution and output quality (around 720p), which can introduce flickering and softness.

With the latest Seedance 2.0 release, there’s a feature we think might be even more transformative than the base video model itself: Seedance Omni. Similar to Kling Omni, Luma Modify, and Runway Aleph, Seedance Omni lets you guide the AI and make targeted edits to an existing video clip. It supports up to 9 reference images, 3 video clips, and 3 audio clips, allowing it to synthesize multiple layers of creative direction. We tested it across a range of scenarios (full prompts in the 🧵) 1. Modify eye color 2. Change weather 3. Time travel effect 4. Character swap 5. Add a spaceship 6. Change asteroids to meatballs 7. Dragon emerging from the clouds Verdict🏆: Seedance Omni excels at physical video dynamics, large visual effects, and environmental transformations. Its main weakness is resolution and output quality (around 720p), which can introduce flickering and softness.

Curious Refuge

53,283 görüntüleme • 5 ay önce

The QVAC SDK puts the "brain" directly into your pocket. From real-time on-device translation to multimodal understanding, build apps that work everywhere, even 30,000 feet in the air. Local AI is here: 💡Offline-First: No cloud, no latency, no "Department of Truth". 💻 Universal API: One codebase for iOS, Android, macOS, and Linux. 🔍 Multimodal: Understanding text, audio, and images without a server. If you can dream it, you can build it. The era of Stable Intelligence has begun. Start building:

The QVAC SDK puts the "brain" directly into your pocket. From real-time on-device translation to multimodal understanding, build apps that work everywhere, even 30,000 feet in the air. Local AI is here: 💡Offline-First: No cloud, no latency, no "Department of Truth". 💻 Universal API: One codebase for iOS, Android, macOS, and Linux. 🔍 Multimodal: Understanding text, audio, and images without a server. If you can dream it, you can build it. The era of Stable Intelligence has begun. Start building:

QVAC

36,457 görüntüleme • 3 ay önce

Gemini 3.1 Flash-Lite is the model to build always on multimodal AI Agents. I just built a memory agent with Google ADK that runs 24/7. Feed it text, images, audio, video or PDFs. It reads, connects and consolidates everything while you sleep. 100% Opensource code.

Gemini 3.1 Flash-Lite is the model to build always on multimodal AI Agents. I just built a memory agent with Google ADK that runs 24/7. Feed it text, images, audio, video or PDFs. It reads, connects and consolidates everything while you sleep. 100% Opensource code.

Shubham Saboo

24,930 görüntüleme • 5 ay önce

We partnered with artists, designers, and builders to create new AI tools that solve real problems in their creative workflows. Here’s what’s new: — Introducing Google Pics in Google Workspace: A brand-new image creation & editing tool. Move and resize objects, add text, and translate just by hovering and clicking — Big updates to صافي النيه😉: 1) You can now create with Gemini Omni Flash in Google Flow 2) Google Flow Agent is a multi-step creative partner that reasons and plans complex tasks with you. 3) Google Flow tools are custom tools you can “vibe code” for animations, video effects, text layering & more — Design live with Stitch by Google: Now, you can use text or voice prompts to edit layouts in real time then export those designs straight to code — More creative control in صافي النيه😉Music: Edit songs section by section, remix the style of full songs, and create music videos with our new Gemini Omni Flash model

Google AI

13,953,239 görüntüleme • 2 ay önce

Wan2.5: Let Sound Take the Director’s Chair! 🎬 Today, we’re excited to unveil another major feature in our powerful Wan 2.5 Preview: Native Audio-Driven Video Generation. ✨ Now you can use audio input directly for both text-to-video and image-to-video generation. Combine audio with text prompts or a reference image to shape your video's narrative. ✨ With support for videos up to 10 seconds and enhanced video quality, unlock a richer visual space where more engaging stories come to life.

Wan2.5: Let Sound Take the Director’s Chair! 🎬 Today, we’re excited to unveil another major feature in our powerful Wan 2.5 Preview: Native Audio-Driven Video Generation. ✨ Now you can use audio input directly for both text-to-video and image-to-video generation. Combine audio with text prompts or a reference image to shape your video's narrative. ✨ With support for videos up to 10 seconds and enhanced video quality, unlock a richer visual space where more engaging stories come to life.

Wan

52,119 görüntüleme • 10 ay önce

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 görüntüleme • 10 ay önce

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 görüntüleme • 2 yıl önce

The biggest mistake people make with Seedance 2.0 is writing prompts at all. Sounds strange, but the model wasn't built for describing things in words - it was built for multimodal direction: up to 12 references at once, combining images, video, and audio. Each reference type controls a different layer: the image sets the style, the video defines the camera movement, the audio sets the rhythm of the scene. When you combine all three instead of typing "camera slowly pushes in, tense atmosphere" - the model understands it directly, with no interpretation and no guessing involved. Text is the weakest control tool available here. And most users are stuck using exactly that.

The biggest mistake people make with Seedance 2.0 is writing prompts at all. Sounds strange, but the model wasn't built for describing things in words - it was built for multimodal direction: up to 12 references at once, combining images, video, and audio. Each reference type controls a different layer: the image sets the style, the video defines the camera movement, the audio sets the rhythm of the scene. When you combine all three instead of typing "camera slowly pushes in, tense atmosphere" - the model understands it directly, with no interpretation and no guessing involved. Text is the weakest control tool available here. And most users are stuck using exactly that.

Zentrix⌚️

72,877 görüntüleme • 14 gün önce

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Text-to-speech is moving way too fast. Just a few days ago, I tweeted about PersonaPlex-7B, NVIDIA's new open source TTS ( And today, Qwen just open-sourced Qwen3-TTS 🤯 It’s a revolutionary text-to-speech model built for control. Not just about generating speech, but about shaping how it sounds directly from language. You can guide the pace, the tone, and the expressiveness straight from text, without touching audio graphs or hand-tuning parameters. That’s the real shift! What makes Qwen3-TTS stand out is how practical it already is: → voice cloning from just a few seconds of audio → voice creation without any reference sample → support for 10 languages out of the box → end-to-end latency down to ~97ms → works in both streaming and non-streaming setups The models come in two sizes (0.6B and 1.7B), so you can trade off quality and hardware cost depending on your setup. You can work with curated voices, designed voices, or cloned ones, and it integrates cleanly with vLLM for production use. It also ships as a simple Python package you can pip install. If you’re building real-time voice systems, this removes a lot of friction! 100% free and open source. I put the repo in the 🧵↓

Charly Wargnier

59,144 görüntüleme • 6 ay önce

NVIDIA JUST DROPPED A FREE AI MODEL THAT READS PDFS, WATCHES VIDEOS, LISTENS TO AUDIO, AND UNDERSTANDS YOUR SCREEN SIMULTANEOUSLY. Not one at a time. ALL AT ONCE. In a single pass. It is called Nemotron 3 Nano Omni and it runs 9 times faster than every other multimodal model currently available. Think about what that actually means for how you work. Right now you are switching between tools constantly. One tool for transcribing your call recordings. A different tool for analyzing your client PDFs. Another tool for processing your training videos. A separate workflow for understanding what is happening on your screen. Four tools. Four contexts. Four different outputs you have to manually synthesize into one decision. Nemotron 3 Nano Omni does all of it in one model. One pass. One output. The use cases that just got dramatically simpler: Meeting recordings where you need the transcript, the visual context, and the document references all analyzed together. Training videos where the audio, the slides, and the on-screen demonstrations all feed into one coherent summary. Client PDFs where you need the document content cross-referenced against your screen data and your call notes simultaneously. Sales call transcripts analyzed alongside the proposals and the CRM data in one unified pass. This is not a marginal improvement on existing multimodal models. It is a 9x speed increase on a capability that was already changing how people work. Free. From NVIDIA. Available right now. Bookmark this before everyone catches on. Follow CyrilXBT for every AI capability shift the moment it drops.

NVIDIA JUST DROPPED A FREE AI MODEL THAT READS PDFS, WATCHES VIDEOS, LISTENS TO AUDIO, AND UNDERSTANDS YOUR SCREEN SIMULTANEOUSLY. Not one at a time. ALL AT ONCE. In a single pass. It is called Nemotron 3 Nano Omni and it runs 9 times faster than every other multimodal model currently available. Think about what that actually means for how you work. Right now you are switching between tools constantly. One tool for transcribing your call recordings. A different tool for analyzing your client PDFs. Another tool for processing your training videos. A separate workflow for understanding what is happening on your screen. Four tools. Four contexts. Four different outputs you have to manually synthesize into one decision. Nemotron 3 Nano Omni does all of it in one model. One pass. One output. The use cases that just got dramatically simpler: Meeting recordings where you need the transcript, the visual context, and the document references all analyzed together. Training videos where the audio, the slides, and the on-screen demonstrations all feed into one coherent summary. Client PDFs where you need the document content cross-referenced against your screen data and your call notes simultaneously. Sales call transcripts analyzed alongside the proposals and the CRM data in one unified pass. This is not a marginal improvement on existing multimodal models. It is a 9x speed increase on a capability that was already changing how people work. Free. From NVIDIA. Available right now. Bookmark this before everyone catches on. Follow CyrilXBT for every AI capability shift the moment it drops.

CyrilXBT

37,847 görüntüleme • 3 ay önce