Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Today, we introduce HunyuanImage 3.0-Instruct, a native multimodal model focusing on image-editing by integrating visual understanding with precise image synthesis! 🚀 It understands input images and reasons before generating images. Built on an 80B-parameter MoE architecture (13B activated), it natively unifies deep multimodal comprehension and high-fidelity generation. 🧠 A... "Thinking" Model with Native CoT & MixGRPO: The model doesn’t just execute commands, it processes them through a Native Chain-of-Thought (CoT) schema. Enhanced by our self-developed MixGRPO algorithm, it reasons through complex instructions to achieve flawless intent alignment and human-preference consistency. 🎨 Precise Editing & Multi-Image Fusion: The model enables accurate image editing by adding, removing, or modifying elements while keeping non-target areas perfectly intact. It also excels at seamless multi-image fusion, synthesizing complex scenes by extracting and blending elements from multiple sources into a unified, consistent output. 🏆 SOTA Performance: HunyuanImage 3.0-Instruct sets a new benchmark in visual quality and alignment, delivering performance that matches leading proprietary models. We aim to enable the community to explore new ideas with a state-of-the-art foundation model, fostering a dynamic and vibrant image generation ecosystem. 🛠️🎨 💻Try it at (PC only):show more

Tencent Hy

39,674 subscribers

125,803 views • 4 months ago •via X (Twitter)

Arts Education Science & Technology

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 views • 9 months ago

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

We’re excited to announce the release and open-source of HunyuanImage 3.0 — the largest and most powerful open-source text-to-image model to date, with over 80 billion total parameters, of which 13 billion are activated per token during inference.The effect is completely comparable to the industry’s flagship closed-source model.🚀🚀🚀 HunyuanImage 3.0 originates from our internally developed native multimodal large language model, with fine-tuning and post-training focused on text-to-image generation. This unique foundation gives the model a powerful set of capabilities: ✅Reason with world knowledge ✅Understand complex, thousand-word prompts ✅Generate precise text within images Different from traditional DiT architecture image generation models, HunyuanImage 3.0’s MoE architecture uses a Transfusion-based approach to deeply couple Diffusion and LLM training for a single, powerful system. Built on Hunyuan-A13B, HunyuanImage 3.0 was trained on a massive dataset: 5 billion image-text pairs, video frames, interleaved image-text data, and 6 trillion tokens of text corpora. This hybrid training across multimodal generation, understanding, and LLM capabilities allows the model to seamlessly integrate multiple tasks. Whether you're an illustrator, designer, or creator, this is built to slash your workflow from hours to minutes. HunyuanImage 3.0 can generate intricate text, detailed comics, expressive emojis, and lively, engaging illustrations for educational content. The current release focuses solely on text-to-image generation and future updates will include image-to-image, image editing, multi-turn interaction, and more. 👉🏻Try it now: 🔗GitHub: 🤗Hugging Face:

Tencent Hy

412,572 views • 8 months ago

Introducing Wan2.6 - A native multimodal model that turns your ideas into breathtaking videos and images! · Starring: Cast characters from reference videos into new scenes. Support human or human-like figures, enabling complex multi-person and human-object interactions with appearance and voice consistency. · Intelligent Multi-shot Narrative: Turn simple prompts into auto-storyboarded, multi-shot videos. Maintain visual consistency and upgrade storytelling from single shots to rich narratives. · Native A/V Sync: Generate multi-speaker dialogue with natural lip-sync and studio-quality audio. It doesn’t just look real - it sounds real. · Cinematic Quality: 15s 1080p HD generation with comprehensive upgrades to instruction adherence, motion physics, and aesthetic control. · Advanced Image Synthesis and Editing: Deliver cinematic photorealism with precise control over lens and lighting. Support multi-image referencing for commercial-grade consistency and faithful aesthetic transfer. · Storytelling with Structure: Generate interleaved texts and images powered by real-world knowledge and reasoning capabilities, enabling hierarchical and structured visual narratives.

Introducing Wan2.6 - A native multimodal model that turns your ideas into breathtaking videos and images! · Starring: Cast characters from reference videos into new scenes. Support human or human-like figures, enabling complex multi-person and human-object interactions with appearance and voice consistency. · Intelligent Multi-shot Narrative: Turn simple prompts into auto-storyboarded, multi-shot videos. Maintain visual consistency and upgrade storytelling from single shots to rich narratives. · Native A/V Sync: Generate multi-speaker dialogue with natural lip-sync and studio-quality audio. It doesn’t just look real - it sounds real. · Cinematic Quality: 15s 1080p HD generation with comprehensive upgrades to instruction adherence, motion physics, and aesthetic control. · Advanced Image Synthesis and Editing: Deliver cinematic photorealism with precise control over lens and lighting. Support multi-image referencing for commercial-grade consistency and faithful aesthetic transfer. · Storytelling with Structure: Generate interleaved texts and images powered by real-world knowledge and reasoning capabilities, enabling hierarchical and structured visual narratives.

Wan

3,847,034 views • 6 months ago

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

Introducing ChatGPT Images 2.0 A state-of-the-art image model that can take on complex visual tasks and produce precise, immediately usable visuals, with sharper editing, richer layouts, and thinking-level intelligence. Video made with ChatGPT Images

OpenAI

12,872,759 views • 2 months ago

Introducing Nano Banana Pro (Gemini 3 Pro Image), our new state-of-the-art image generation and editing model from Google DeepMind. It improves on the original model while adding new advanced capabilities, enhanced world knowledge and text rendering, allowing you to create and edit studio-quality, production-ready visuals.

Introducing Nano Banana Pro (Gemini 3 Pro Image), our new state-of-the-art image generation and editing model from Google DeepMind. It improves on the original model while adding new advanced capabilities, enhanced world knowledge and text rendering, allowing you to create and edit studio-quality, production-ready visuals.

Google

1,896,824 views • 7 months ago

Step into the future of AI image generation with Qwen-Image! From superior text rendering to consistent image editing across multiple languages, it sets a new benchmark! 💡What will you create with Qwen-Image?

Step into the future of AI image generation with Qwen-Image! From superior text rendering to consistent image editing across multiple languages, it sets a new benchmark! 💡What will you create with Qwen-Image?

Alibaba Group

203,475 views • 10 months ago

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

CosmicMan A Text-to-Image Foundation Model for Humans We present CosmicMan, a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and

AK

46,778 views • 2 years ago

Seedream 4 arrived to Krea. this new model is competitive with Nano Banana for image editing and supports native 4k resolution. try it now in Krea Image!

Seedream 4 arrived to Krea. this new model is competitive with Nano Banana for image editing and supports native 4k resolution. try it now in Krea Image!

KREA AI

52,130 views • 9 months ago

Meet Qwen-Image-Edit, your smart, seamless image editing companion! Built on the powerful 20B Qwen-Image model, it handles both visual and semantic edits, and delivers state-of-the-art performance. Whether you're refining visuals or transforming styles, it’s got you covered! ✨

Meet Qwen-Image-Edit, your smart, seamless image editing companion! Built on the powerful 20B Qwen-Image model, it handles both visual and semantic edits, and delivers state-of-the-art performance. Whether you're refining visuals or transforming styles, it’s got you covered! ✨

Alibaba Group

64,349 views • 10 months ago

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

TurboEdit Instant text-based image editing discuss: We address the challenges of precise image inversion and disentangled image editing in the context of few-step diffusion models. We introduce an encoder based iterative inversion technique. The inversion network is conditioned on the input image and the reconstructed image from the previous step, allowing for correction of the next reconstruction towards the input image. We demonstrate that disentangled controls can be easily achieved in the few-step diffusion model by conditioning on an (automatically generated) detailed text prompt. To manipulate the inverted image, we freeze the noise maps and modify one attribute in the text prompt (either manually or via instruction based editing driven by an LLM), resulting in the generation of a new image similar to the input image with only one attribute changed. It can further control the editing strength and accept instructive text prompt. Our approach facilitates realistic text-guided image edits in real-time, requiring only 8 number of functional evaluations (NFEs) in inversion (one-time cost) and 4 NFEs per edit. Our method is not only fast, but also significantly outperforms state-of-the-art multi-step diffusion editing techniques.

AK

16,062 views • 1 year ago

1/4 🚀We are launching Qwen-Image-2.0, a next-generation foundational image generation model. The key highlights of Qwen-Image-2.0 include: Professional Typography Rendering: Supports 1k-token instructions for direct generation of professional infographics, including PPTs, posters, comics, and more. Stronger Semantic Adherence: Native 2K resolution support for finely detailed realistic scenes, including people, nature, and architecture. Improved Text Rendering: Integrated understanding and generation capabilities, unifying image generation and editing in a single mode Lighter Model Architecture: Smaller model size with faster inference speed.

1/4 🚀We are launching Qwen-Image-2.0, a next-generation foundational image generation model. The key highlights of Qwen-Image-2.0 include: Professional Typography Rendering: Supports 1k-token instructions for direct generation of professional infographics, including PPTs, posters, comics, and more. Stronger Semantic Adherence: Native 2K resolution support for finely detailed realistic scenes, including people, nature, and architecture. Improved Text Rendering: Integrated understanding and generation capabilities, unifying image generation and editing in a single mode Lighter Model Architecture: Smaller model size with faster inference speed.

Tongyi Lab

164,097 views • 4 months ago

Compute is the backbone of the AI-driven future. OptimAI Compute Engine would enable scalable, high-performance workloads across image and video models, supporting: • Text-to-image generation • Image editing and inpainting • 2K / 4K / 8K super-resolution • Brand-aligned visual synthesis • OCR and image intelligence • Video generation and motion synthesis • Frame interpolation and enhancement • Multimodal model inference Distributed compute built for faster inference, shorter training cycles, and production-grade AI execution.

OptimAI Network

18,767 views • 4 months ago

Introducing ChatGPT Images, powered by our flagship new image generation model. - Stronger instruction following - Precise editing - Detail preservation - 4x faster than before Rolling out today in ChatGPT for all users, and in the API as GPT Image 1.5.

Introducing ChatGPT Images, powered by our flagship new image generation model. - Stronger instruction following - Precise editing - Detail preservation - 4x faster than before Rolling out today in ChatGPT for all users, and in the API as GPT Image 1.5.

OpenAI

3,232,175 views • 6 months ago

🚀UniWorld: a unified model that skips VAEs and uses semantic features from SigLIP! Using just 1% of BAGEL’s data, it outperforms on image editing and excels in understanding & generation. 🌟Now data, model, training & evaluation script are open-source!

🚀UniWorld: a unified model that skips VAEs and uses semantic features from SigLIP! Using just 1% of BAGEL’s data, it outperforms on image editing and excels in understanding & generation. 🌟Now data, model, training & evaluation script are open-source!

Bin Lin

22,327 views • 1 year ago

Learn about Google’s new SOTA image model, Gemini 2.5 Flash, its key capabilities, and what’s next on the roadmap with some of the team behind the model Nicole Brichtova Kaushik Shivakumar Mostafa Dehghani Robert Riachi with Logan Kilpatrick. Timecodes: 0:37 New model introduction 01:21 Demo: Image editing 03:44 Text rendering capabilities 04:44 Beyond human preference evals 06:44 Text rendering as a proxy for quality 08:38 Positive transfer between modalities 11:25 Demo: multi-turn, context aware image generation 13:54 Pixel-perfect editing and character consistency 15:51 Interleaved image generation 17:59 Specialized vs. native models 19:52 Understanding nuanced prompts 20:59 User feedback shaping model development 22:37 Improvements in character consistency 24:17 More natural looking images from team collaboration 26:41 What’s next for image generation models

Learn about Google’s new SOTA image model, Gemini 2.5 Flash, its key capabilities, and what’s next on the roadmap with some of the team behind the model Nicole Brichtova Kaushik Shivakumar Mostafa Dehghani Robert Riachi with Logan Kilpatrick. Timecodes: 0:37 New model introduction 01:21 Demo: Image editing 03:44 Text rendering capabilities 04:44 Beyond human preference evals 06:44 Text rendering as a proxy for quality 08:38 Positive transfer between modalities 11:25 Demo: multi-turn, context aware image generation 13:54 Pixel-perfect editing and character consistency 15:51 Interleaved image generation 17:59 Specialized vs. native models 19:52 Understanding nuanced prompts 20:59 User feedback shaping model development 22:37 Improvements in character consistency 24:17 More natural looking images from team collaboration 26:41 What’s next for image generation models

Google AI Developers

31,150 views • 10 months ago

New: Gemini 2.0 Flash native image generation! This bot supports image output and conversational editing, allowing you to create and refine images by describing what you want. (1/3)

New: Gemini 2.0 Flash native image generation! This bot supports image output and conversational editing, allowing you to create and refine images by describing what you want. (1/3)

Poe

18,992 views • 1 year ago

Dreamina Seedance 2.0 is by far the best video model I've tried, and Dreamina makes it even better. You can try it in the link below. It's not just about the quality of the output, but about how much control you have over your video's look. Watch my video here. I'm attaching reference images and asking the model to generate a video using them. I can reference each image using the @ symbol to instruct the model which image to use. You can even upload a clip and use its camera movement, styles from an image, and audio vibe from a track. By the way, you can take an existing video and replace, remove, or add elements to it while the model preserves everything else. This is the closest we've gotten to "editing videos like photos".

Dreamina Seedance 2.0 is by far the best video model I've tried, and Dreamina makes it even better. You can try it in the link below. It's not just about the quality of the output, but about how much control you have over your video's look. Watch my video here. I'm attaching reference images and asking the model to generate a video using them. I can reference each image using the @ symbol to instruct the model which image to use. You can even upload a clip and use its camera movement, styles from an image, and audio vibe from a track. By the way, you can take an existing video and replace, remove, or add elements to it while the model preserves everything else. This is the closest we've gotten to "editing videos like photos".

Santiago

45,143 views • 2 months ago

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,827 views • 7 months ago

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

AK

194,356 views • 2 years ago

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Introducing SDXL Turbo: A real-time text-to-image generation model. SDXL Turbo achieves state-of-the-art performance with a new distillation technology, enabling single-step image generation with unprecedented quality, reducing the required step count from 50 to just one. The code, research paper, and weights for non-commercial use are now available on our website. You can test SDXL Turbo on Stability AI’s image editing platform Clipdrop, with a beta demonstration of the real-time text-to-image generation capabilities. Learn more:

Stability AI

976,312 views • 2 years ago