Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

✨New preprint: Dual-Process Image Generation! We distill feedback from a VLM into feed-forward image generation, at inference time. The result is flexible control: parameterize tasks as multimodal inputs, visually inspect the images with the VLM, and update the generator.🧵

Grace Luo

2,223 subscribers

133,297 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie Bildung

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

PaLM-E is the largest VLM reported to date. We observe emergent capabilities like multimodal chain of thought reasoning, and multi-image inference, despite being trained on only single-image prompts. Though not the focus of our work, PaLM-E sets a new SOTA on OK-VQA benchmark.

PaLM-E is the largest VLM reported to date. We observe emergent capabilities like multimodal chain of thought reasoning, and multi-image inference, despite being trained on only single-image prompts. Though not the focus of our work, PaLM-E sets a new SOTA on OK-VQA benchmark.

Danny Driess

12,291 Aufrufe • vor 3 Jahren

Vision-language AI models have a gaze. And you can steer it! 👀 Redirect just 9% of a model’s attention heads to any region in an image, and the VLM will start describing that region mid-generation. We call them Gaze Heads! Try the demo: 🧵👇

Vision-language AI models have a gaze. And you can steer it! 👀 Redirect just 9% of a model’s attention heads to any region in an image, and the VLM will start describing that region mid-generation. We call them Gaze Heads! Try the demo: 🧵👇

Rohit Gandikota

48,458 Aufrufe • vor 14 Tagen

🚀 Q2 Image Model is Live! ✨Text to image, Reference to image, and Image editing supported ✨Ultra-fast generation (as fast as 5s), 4K quality, super consistency ✨One-stop workflow: turn reference images into subjects and reuse them for video creation ✨Unlimited image generation for members until Dec 31 New users: use code VIDUQ2RTI for bonus credits 🎁 #ViduQ2RTI #ViduQ2 #Viduai #vidu

🚀 Q2 Image Model is Live! ✨Text to image, Reference to image, and Image editing supported ✨Ultra-fast generation (as fast as 5s), 4K quality, super consistency ✨One-stop workflow: turn reference images into subjects and reuse them for video creation ✨Unlimited image generation for members until Dec 31 New users: use code VIDUQ2RTI for bonus credits 🎁 #ViduQ2RTI #ViduQ2 #Viduai #vidu

Vidu AI

24,747 Aufrufe • vor 7 Monaten

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

We've officially released and open-sourced HunyuanImage 2.1, our latest text-to-image model. The new model delivers on our commitment to balancing performance and quality. With native 2K image generation, HunyuanImage 2.1 is an advanced open-source text-to-image model.🎨 ✨ New in 2.1: 🔹Advanced Semantics: Supports ultra-long and complex prompts of up to 1000 tokens, and precisely controls the generation of multiple subjects in a single image. 🔹Precise Chinese and English Text Rendering with seamless image–text integration: The model naturally integrates text into images, making it suitable for a wide range of applications such as product covers, illustrations, and poster design to meet the needs of various fields. 🔹Rich Styles and High Aesthetic: Capable of generating images in various styles—including photorealistic portraits, comics, and vinyl figures—it delivers outstanding visual appeal and artistic quality. 🔹High-Quality Generation: Efficiently produces ultra-high-definition (2K) images in the same time other models take to generate a 1K image. HunyuanImage 2.1 uses two text encoders: a multimodal large language model (MLLM) to improve the model's image and text alignment capabilities, and a multi-language character-aware encoder to improve text rendering capabilities. The model is a single- and double-stream diffusion transformer with 17B parameters. We've also open-sourced the weights of the the accelerated version with meanflow which reduces inference steps from 100 to just 8, and PromptEnhancer, the first industrial-grade rewriting model that enhances your prompts for more nuanced and expressive image generation. Now, creators turn complex ideas—like posters with slogans or multi-panel comics—into visuals faster than ever. We’re just getting started. Stay tuned for our native multimodal image generation model coming soon. 🌐Website: 🔗Github: 🤗Hugging Face: ✨Hugging Face Demo:

Tencent Hy

89,257 Aufrufe • vor 9 Monaten

Why don’t VLAs generalize as well as their VLM counterparts? One culprit: catastrophic forgetting during fine-tuning. 🧠 We introduce VLM2VLA: a training paradigm that preserves the VLM capabilities while teaching robotic control. 🧵

Why don’t VLAs generalize as well as their VLM counterparts? One culprit: catastrophic forgetting during fine-tuning. 🧠 We introduce VLM2VLA: a training paradigm that preserves the VLM capabilities while teaching robotic control. 🧵

Anirudha Majumdar

60,486 Aufrufe • vor 9 Monaten

Step into the future of AI image generation with Qwen-Image! From superior text rendering to consistent image editing across multiple languages, it sets a new benchmark! 💡What will you create with Qwen-Image?

Step into the future of AI image generation with Qwen-Image! From superior text rendering to consistent image editing across multiple languages, it sets a new benchmark! 💡What will you create with Qwen-Image?

Alibaba Group

203,475 Aufrufe • vor 10 Monaten

✨ IT'S HERE! ✨ Build the life you see and dream with 4o Image Generation. Show me your best images with #MakeItWithCopilot 🙌

✨ IT'S HERE! ✨ Build the life you see and dream with 4o Image Generation. Show me your best images with #MakeItWithCopilot 🙌

Microsoft Copilot

19,197 Aufrufe • vor 1 Jahr

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

📢WorldAgents: 3D worlds only from 2D image models - without any training! We propose an agentic approach with a Director (VLM) to plan the scene, a Generator (Flux or NanoBanana) for new views, and a Verifier (VLM) for selection / 3D consistency. -> High-fidelity 3D worlds from a single text prompt. What's remarkable: our agents find consistent views from 2D image models to obtain 3D-consistent worlds; this shows that image models contain world priors - agents just need to find them! Great work by Ziya Erkoç Angela Dai

Matthias Niessner

18,923 Aufrufe • vor 3 Monaten

Create or transform images into a variety of styles with 4o image generation.

Create or transform images into a variety of styles with 4o image generation.

OpenAI

258,807 Aufrufe • vor 1 Jahr

FET/ASI Crypto Drops Bombshell Update: Text to Image Generation ASI-1 Mini🔥 ASI-1 Mini is the first model in the ASI:Train family. Today they introduce Text-to-Image Generation, enabling users to create high-quality images from textual descriptions. Fetch.ai Artificial Superintelligence Alliance

FET/ASI Crypto Drops Bombshell Update: Text to Image Generation ASI-1 Mini🔥 ASI-1 Mini is the first model in the ASI:Train family. Today they introduce Text-to-Image Generation, enabling users to create high-quality images from textual descriptions. Fetch.ai Artificial Superintelligence Alliance

cryptoFOXXY

16,613 Aufrufe • vor 1 Jahr

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

🚀 Introducing Emu3.5 — a large-scale multimodal world model that natively predicts the next vision-language state. 🔥 Trained on over 10T interleaved vision-language tokens and enhanced with reinforcement learning, Emu3.5 achieves powerful multimodal reasoning and generation. ⚡ Powered by our new Discrete Diffusion Adaptation (DiDA) for 20× faster inference. 🔥 Emu3.5 outperforms Nano Banana across image generation, editing, interleaved tasks and more. 🌍 Explore Emu3.5: Github: #Emu3 #MultimodalAI #WorldModel #NextTokenPrediction

BAAI

51,880 Aufrufe • vor 8 Monaten

Today we launched Nano Banana Pro (Gemini 3 Pro Image). This state-of-the-art image generation and editing model turns your vision into functional reality with unprecedented control, improved text rendering and factual accuracy.

Today we launched Nano Banana Pro (Gemini 3 Pro Image). This state-of-the-art image generation and editing model turns your vision into functional reality with unprecedented control, improved text rendering and factual accuracy.

News from Google

215,740 Aufrufe • vor 7 Monaten

TBH, Native Image Generation Is Very Cool 😎 Google’s other release today is far more interesting than Gemma With native image generation the LLM understands images easily and can do much better edits Will it have on ChatLLM tomorrow

TBH, Native Image Generation Is Very Cool 😎 Google’s other release today is far more interesting than Gemma With native image generation the LLM understands images easily and can do much better edits Will it have on ChatLLM tomorrow

Bindu Reddy

12,387 Aufrufe • vor 1 Jahr

Turns out that vision-language models can control robots too. The secret is to just finetune them to print out the actions (literally, as text). Really excited about our new result, the successor to RT-1. RT-2 is a pre-trained VLM: Short 🧵👇

Turns out that vision-language models can control robots too. The secret is to just finetune them to print out the actions (literally, as text). Really excited about our new result, the successor to RT-1. RT-2 is a pre-trained VLM: Short 🧵👇

Sergey Levine

165,052 Aufrufe • vor 2 Jahren

📢 OneCanvas: 3D Scene Understanding via Panoramic Reprojection We extract features from video frames and reproject them into one occlusion-free view of the whole scene that a 2D VLM reads just like a normal image. We can center this view on any viewpoint, including an agent's own pose for situated reasoning. The same projection lets us create spatial training tasks with no human annotation, solvable only by reasoning over the 3D positions of real object features placed on an otherwise empty canvas. The result is a stock 2D VLM that reasons in 3D, setting a new state of the art across spatial benchmarks at far less compute. 🌐 ▶️ Great work by Bartłomiej Baranowski & Dave Zhenyu Chen

📢 OneCanvas: 3D Scene Understanding via Panoramic Reprojection We extract features from video frames and reproject them into one occlusion-free view of the whole scene that a 2D VLM reads just like a normal image. We can center this view on any viewpoint, including an agent's own pose for situated reasoning. The same projection lets us create spatial training tasks with no human annotation, solvable only by reasoning over the 3D positions of real object features placed on an otherwise empty canvas. The result is a stock 2D VLM that reasons in 3D, setting a new state of the art across spatial benchmarks at far less compute. 🌐 ▶️ Great work by Bartłomiej Baranowski & Dave Zhenyu Chen

Matthias Niessner

24,613 Aufrufe • vor 11 Tagen

How did we get from Will Smith eating spaghetti to modern AI image + video generation? 🍝 🤖 We're explaining how products like the ChainGPT NFT Generator work to create amazing images from simple prompts. ➡️

How did we get from Will Smith eating spaghetti to modern AI image + video generation? 🍝 🤖 We're explaining how products like the ChainGPT NFT Generator work to create amazing images from simple prompts. ➡️

ChainGPT

75,337 Aufrufe • vor 5 Monaten

Compute enabled our first image generation launch (and a +32% jump in WAU over the following weeks) as well as our latest image generation launch yesterday. We have a lot more coming… and need a lot more compute.

Compute enabled our first image generation launch (and a +32% jump in WAU over the following weeks) as well as our latest image generation launch yesterday. We have a lot more coming… and need a lot more compute.

OpenAI

777,887 Aufrufe • vor 6 Monaten

GMI Studio is officially here. We’ve moved beyond multimodal hubs to create a dedicated workflow engine for high-end AI creation. - Long-form video (2 min+). - 1 image, 10 video styles, cinematic output. - 1.4x faster generation speed, powered by GMI inference. - Social feed: Discover and like. The beta is over. The studio is open.

GMI Studio is officially here. We’ve moved beyond multimodal hubs to create a dedicated workflow engine for high-end AI creation. - Long-form video (2 min+). - 1 image, 10 video styles, cinematic output. - 1.4x faster generation speed, powered by GMI inference. - Social feed: Discover and like. The beta is over. The studio is open.

GMI Cloud

971,213 Aufrufe • vor 5 Monaten

Image Generation has landed🎨🤖 Introducing Image Arena - a battle of image generation models like FLUX, Stable Diffusion, Dall-E, Recraft, Ideogram, and more! Who will reign supreme? 1. Describe your desired image🎨 2. Two anonymous models output images 3. Vote for the winner! Enjoy! We will be releasing the leaderboard soon! More examples below👇

Image Generation has landed🎨🤖 Introducing Image Arena - a battle of image generation models like FLUX, Stable Diffusion, Dall-E, Recraft, Ideogram, and more! Who will reign supreme? 1. Describe your desired image🎨 2. Two anonymous models output images 3. Vote for the winner! Enjoy! We will be releasing the leaderboard soon! More examples below👇

lmarena.ai (formerly lmsys.org)

106,853 Aufrufe • vor 1 Jahr

AI is moving fast. And OpenAI's new image model proves it — knowing how to use AI images to your advantage is now crucial for designers. So we teamed up with Nick St. Pierre (The AI Image Master) to walkthrough one of the best AI image generation tools right now, Midjourney, and build a full website using only AI images, in 1 hour. Watch the full breakdown👇

AI is moving fast. And OpenAI's new image model proves it — knowing how to use AI images to your advantage is now crucial for designers. So we teamed up with Nick St. Pierre (The AI Image Master) to walkthrough one of the best AI image generation tools right now, Midjourney, and build a full website using only AI images, in 1 hour. Watch the full breakdown👇

Relume

14,706 Aufrufe • vor 1 Jahr