Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.

Z.ai

26,873 subscribers

13,655 просмотров • 6 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Let's go hands-on with #GeminiAI. Our newest AI model can reason across different types of inputs and outputs — like images and text. See Gemini's multimodal reasoning capabilities in action ↓

Let's go hands-on with #GeminiAI. Our newest AI model can reason across different types of inputs and outputs — like images and text. See Gemini's multimodal reasoning capabilities in action ↓

Google

1,005,799 просмотров • 2 лет назад

3D AI is leveling up! Rodin 3D AI can create stunning, high-quality 3D models from just text or image inputs. And with its latest update, it can even generate 8K HDRI textures to bring your models to life. Check out the link in the comments!

3D AI is leveling up! Rodin 3D AI can create stunning, high-quality 3D models from just text or image inputs. And with its latest update, it can even generate 8K HDRI textures to bring your models to life. Check out the link in the comments!

el.cine

46,032 просмотров • 1 год назад

Meet Gemini Omni, our new model that can create anything from any input, starting with video. With Gemini Omni, you can combine images, videos and text as inputs and generate high-quality videos grounded in Gemini's real-world knowledge.

Meet Gemini Omni, our new model that can create anything from any input, starting with video. With Gemini Omni, you can combine images, videos and text as inputs and generate high-quality videos grounded in Gemini's real-world knowledge.

Google Gemini

32,779,480 просмотров • 24 дней назад

Meet Gemini Omni, our new model that can create anything from any input, starting with video. With Gemini Omni, you can combine images, videos and text as inputs and generate high-quality videos grounded in Gemini's real-world knowledge. #GoogleIO

Meet Gemini Omni, our new model that can create anything from any input, starting with video. With Gemini Omni, you can combine images, videos and text as inputs and generate high-quality videos grounded in Gemini's real-world knowledge. #GoogleIO

Google Gemini

88,320 просмотров • 1 месяц назад

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

Llama 3.2 features 11B & 90B models, our first multimodal Llama models with support for vision tasks. These models can take in both image and text prompts to deeply understand and reason on inputs.

AI at Meta

121,529 просмотров • 1 год назад

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: - a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, , Hugging Face, Amazon Web Services, DigitalOcean, Lambda

Today, every Nomic-Embed-Text embedding becomes multimodal. Introducing Nomic-Embed-Vision: - a high quality, unified embedding space for image, text, and multimodal tasks - outperforms both OpenAI CLIP and text-embedding-3-small - open weights and code to enable indie hacking, research, and experimentation - released in collaboration with MongoDB, LlamaIndex 🦙, , Hugging Face, Amazon Web Services, DigitalOcean, Lambda

CalCo

103,205 просмотров • 2 лет назад

Ovi is out on Hugging Face Twin Backbone Cross-Modal Fusion for Audio-Video Generation Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs. Video+Audio Generation: Generate synchronized video and audio content simultaneously Flexible Input: Supports text-only or text+image conditioning 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)

Ovi is out on Hugging Face Twin Backbone Cross-Modal Fusion for Audio-Video Generation Ovi is a veo-3 like, video+audio generation model that simultaneously generates both video and audio content from text or text+image inputs. Video+Audio Generation: Generate synchronized video and audio content simultaneously Flexible Input: Supports text-only or text+image conditioning 5-second Videos: Generates 5-second videos at 24 FPS, area of 720×720, at various aspect ratios (9:16, 16:9, 1:1, etc)

AK

23,082 просмотров • 8 месяцев назад

High quality AI generated human videos are coming! Animate Anyone can generate videos of anyone with a single image and a bit of pose guidance 🤯

High quality AI generated human videos are coming! Animate Anyone can generate videos of anyone with a single image and a bit of pose guidance 🤯

Dreaming Tulpa 🥓👑

51,151,925 просмотров • 2 лет назад

Zhipu AI just released GLM-4.6V on Hugging Face This new multimodal model achieves SOTA visual understanding, features native function calling for agents, and handles 128k context for documents. Perception to action!

Zhipu AI just released GLM-4.6V on Hugging Face This new multimodal model achieves SOTA visual understanding, features native function calling for agents, and handles 128k context for documents. Perception to action!

DailyPapers

16,629 просмотров • 6 месяцев назад

Wan2.6: Commercial-Grade Image Output Wan2.6-Image is now available. 🖼️ Interleaved Text-and-Image Output：Generate interleaved text-and-image content with logical reasoning capabilities — enabling layered, narrative-driven visual storytelling. 🖼️ Multi-Image Conditioned Generation：Support flexible referencing, combining, and replacement of multiple images, integrating varied visual inspirations to generate novel and compelling results 🖼️ Commercial-Grade ID Preservation：High consistency in characters, styles, and elements for commercial scenarios. 🖼️ Extract creative elements — such as color, style, and composition — from reference images to enable aesthetically driven image generation. 🖼️ Precise Control of Camera Angles and Lighting：Support specifying camera perspective, spatial depth (foreground/background), and lighting details — enabling precise control over spatial composition and atmospheric mood. Try Wan2.6 today.

Wan2.6: Commercial-Grade Image Output Wan2.6-Image is now available. 🖼️ Interleaved Text-and-Image Output：Generate interleaved text-and-image content with logical reasoning capabilities — enabling layered, narrative-driven visual storytelling. 🖼️ Multi-Image Conditioned Generation：Support flexible referencing, combining, and replacement of multiple images, integrating varied visual inspirations to generate novel and compelling results 🖼️ Commercial-Grade ID Preservation：High consistency in characters, styles, and elements for commercial scenarios. 🖼️ Extract creative elements — such as color, style, and composition — from reference images to enable aesthetically driven image generation. 🖼️ Precise Control of Camera Angles and Lighting：Support specifying camera perspective, spatial depth (foreground/background), and lighting details — enabling precise control over spatial composition and atmospheric mood. Try Wan2.6 today.

Wan

33,253 просмотров • 6 месяцев назад

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 просмотров • 9 месяцев назад

Wan2.6 Image Generation is ready for Commercial Use! - Interleaved Text-and-Image Output - Multi-Image Conditioned Generation - Commercial-Grade ID Preservation - Extract Creative Elements - Precise Control of Camera Angles and Lighting Try Wan2.6 today!

Wan2.6 Image Generation is ready for Commercial Use! - Interleaved Text-and-Image Output - Multi-Image Conditioned Generation - Commercial-Grade ID Preservation - Extract Creative Elements - Precise Control of Camera Angles and Lighting Try Wan2.6 today!

Alibaba Cloud

145,741 просмотров • 5 месяцев назад

High quality AI generated talking heads are coming! GAIA can generate talking avatars from a single portrait image and speech clip. It even supports text prompts like `sad`, `open mouth` or `surprise` to guide video generation. Crazy times ahead 🤯

High quality AI generated talking heads are coming! GAIA can generate talking avatars from a single portrait image and speech clip. It even supports text prompts like `sad`, `open mouth` or `surprise` to guide video generation. Crazy times ahead 🤯

Dreaming Tulpa 🥓👑

660,015 просмотров • 2 лет назад

🔥Excited to introduce CoDi-2! It follows complex multimodal-interleaved in-context instructions to generate any modalities (text, vision, audio) in zero/few-shot interactive way! Ziyi Yang Yang Liu Chenguang Zhu Mohit Bansal 🧵👇

🔥Excited to introduce CoDi-2! It follows complex multimodal-interleaved in-context instructions to generate any modalities (text, vision, audio) in zero/few-shot interactive way! Ziyi Yang Yang Liu Chenguang Zhu Mohit Bansal 🧵👇

Zineng Tang

97,533 просмотров • 2 лет назад

I tested 4 AI tools with one specific prompt. The Results: • Grok & Claude: Provided high-quality research and text. However, the output remained a document. No slide design was provided. • Gemini: Organized the content into a structured outline, but the final layout lacked visual hierarchy. • Converted the prompt directly into a formatted deck. While others generate the content, builds the presentation. It handles the design, layouts, and visual flow automatically. If you need a finished file rather than a text draft, the choice is clear.

I tested 4 AI tools with one specific prompt. The Results: • Grok & Claude: Provided high-quality research and text. However, the output remained a document. No slide design was provided. • Gemini: Organized the content into a structured outline, but the final layout lacked visual hierarchy. • Converted the prompt directly into a formatted deck. While others generate the content, builds the presentation. It handles the design, layouts, and visual flow automatically. If you need a finished file rather than a text draft, the choice is clear.

Vinay Bharambe

39,524 просмотров • 3 месяцев назад

Introducing ConceptAttention, an approach to interpreting diffusion transformer models! Write a prompt, choose some concepts, generate an image, and get high-quality heatmaps of text concepts. Our method outperforms existing methods like cross attention. Link to demo 👇

Introducing ConceptAttention, an approach to interpreting diffusion transformer models! Write a prompt, choose some concepts, generate an image, and get high-quality heatmaps of text concepts. Our method outperforms existing methods like cross attention. Link to demo 👇

Alec Helbling

36,631 просмотров • 1 год назад

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Andrew Ng

73,915 просмотров • 1 год назад

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,523 просмотров • 8 месяцев назад

Alibaba released LHM! a new model that can generate high-quality, animatable 3D human avatars from a single image in just a few seconds

Alibaba released LHM! a new model that can generate high-quality, animatable 3D human avatars from a single image in just a few seconds

Dreaming Tulpa 🥓👑

21,853 просмотров • 1 год назад