Загрузка видео...

Не удалось загрузить видео

На главную

GLM-4.6V can accept multimodal inputs of various types and automatically generate high-quality, structured image-text interleaved content.

13,655 просмотров • 6 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 просмотров • 9 месяцев назад

Explore state-of-the-art multimodal prompting in our new short course Large Multimodal Model Prompting with Gemini, taught by Erwin Huizenga in collaboration with Google Cloud. One interesting insight from this course: with multimodal models, prompt structure matters significantly. Placing text inputs, such as a patient's medical history, before image inputs, like an X-ray, can enhance the model's ability to contextualize and interpret visual data effectively. In other contexts, such as image captioning, you may get better results by putting the image first. Multimodal models behave differently than text-only LLMs, and effective prompting for models varies depending on the model you’re using. In this course you’ll learn how to effectively prompt Gemini models. Gemini's multimodal capabilities also enable new approaches in AI application development, for example: - The Gemini library handles various video formats (MP4, MOV, MPEG), streamlining applications using these formats. - Large context window (up to 1 million tokens) enables processing of extensive content, like analyzing multiple 50-minute videos simultaneously. - Function calling feature integrates real-time data (e.g., current exchange rates) into model responses. The course demonstrates building multimodal applications with real-world examples including document analyzers that reason across text and graphs simultaneously, video content extractors that find and timestamp specific information from multiple hours of footage, and automated expense report systems processing receipt images while cross-referencing company policies. Sign up here:

Andrew Ng

73,915 просмотров • 1 год назад

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,523 просмотров • 8 месяцев назад