Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

🚀 The 4-bit era has arrived! Meet #SVDQuant, our new W4A4 quantization paradigm for diffusion models. Now, 12B FLUX can run on a 16GB 4090 laptop without offloading—with 3x speedups over W4A16 models (like NF4) while maintaining top-tier image quality. #AI #Quantization. 1/7

Muyang Li

2,019 subscribers

50,162 Aufrufe • vor 1 Jahr •via X (Twitter)

Wissenschaft & Technologie #SVDQuant #AI #Quantization

Anya Rossi• Live Now

Private livecam show

10 Kommentare

Profilbild von Muyang Li

Muyang Livor 1 Jahr

Quantization effectively accelerates LLM inference, primarily by cutting weight-loading latency. But for compute-heavy diffusion models, weight quantization alone doesn’t boost speed. For real speedup, we need to quantize both weights and activations to the same bit width. 2/7

Profilbild von Muyang Li

Muyang Livor 1 Jahr

However, W4A4 quantization is tough with massive outliers. #SVDQuant addresses this by smoothing activations and merging its outliers into weights. It then applies SVD to the weights to add a 16-bit low-rank component, which absorbs the quantization difficulty. 3/7

Profilbild von Muyang Li

Muyang Livor 1 Jahr

Running the low-rank branch separately incurs high latency from redundant memory access. Our co-designed #Nunchaku engine uses kernel fusion to share inputs and outputs between branches, cutting memory access and halving kernel calls, reducing overhead to negligible 5–10%.4/7

Profilbild von Muyang Li

Muyang Livor 1 Jahr

On 12B FLUX.1-dev, we cut memory use by 3.6× compared to BF16 and, on a 16GB 4090 GPU, speeds up by 8.7× over 16-bit and 3× over the NF4 W4A16 baseline without loss of image quality. On PixArt-∑, it also outperforms other W4A4 and even W4A8 models in visual quality. 5/7

Profilbild von Muyang Li

Muyang Livor 1 Jahr

Nunchaku removes redundant memory access, allowing SVDQuant to work seamlessly with off-the-shelf LoRA by running it in a separate branch, without re-quantization. Our INT4 FLUX.1-dev model adapts to 5 distinct styles, matching the image quality of the original 16-bit version.6/7

Profilbild von Muyang Li

Muyang Livor 1 Jahr

Paper: Code: Demo: Website: Blog: Collaborate w/@syn7xavier @ZhekaiZhang @tianle_cai @xiuyu_l @jerry_gjx @xieenze_jr @chenlin_meng @junyanz89 @songhan_mit

Profilbild von Ramon Guthrie

Ramon Guthrievor 1 Jahr

The big question is does this support @ComfyUI, Forge and does this work with Loras and Controlnets?

Profilbild von Spacer

Spacervor 1 Jahr

Holy moly, 3x is massive! Would video models like Mochi be compatible with SVDQuant?

Profilbild von Muyang Li

Muyang Livor 1 Jahr

I think so since our method is general purpose. Will work on it.

Profilbild von Danny Ki

Danny Kivor 1 Jahr

Really 4-bit? What if numbers have feelings too?

Ähnliche Videos

LLMs can take gigabytes of memory to store, which limits what can be run on consumer hardware. But quantization can dramatically compress models, making a wider selection of models available to developers. You can often reduce model size by 4x or more while maintaining reasonable performance. In our new short course Quantization Fundamentals taught by Hugging Face's @younesbelkada and Marc Sun, you'll: - Learn how to quantize nearly any open source model - Use int8 and bfloat16 (Brain float 16) data types to load and run LLMs using PyTorch and the Hugging Face Transformers library - Dive into the technical details of linear quantization to map 32-bit floats to 8-bit integers As models get bigger and bigger, quantization becomes more important for making models practical and accessible. Please check out the course here:

LLMs can take gigabytes of memory to store, which limits what can be run on consumer hardware. But quantization can dramatically compress models, making a wider selection of models available to developers. You can often reduce model size by 4x or more while maintaining reasonable performance. In our new short course Quantization Fundamentals taught by Hugging Face's @younesbelkada and Marc Sun, you'll: - Learn how to quantize nearly any open source model - Use int8 and bfloat16 (Brain float 16) data types to load and run LLMs using PyTorch and the Hugging Face Transformers library - Dive into the technical details of linear quantization to map 32-bit floats to 8-bit integers As models get bigger and bigger, quantization becomes more important for making models practical and accessible. Please check out the course here:

Andrew Ng

288,266 Aufrufe • vor 2 Jahren

How different quantization affects FLUX.2 models outputs? Check out on the diffusers + FLUX.2 interactive carousel on the blog

How different quantization affects FLUX.2 models outputs? Check out on the diffusers + FLUX.2 interactive carousel on the blog

apolinario 🌐

12,540 Aufrufe • vor 7 Monaten

WTF?! This changes image generation forever! 🤯 PrismML just released Binary and Ternary Bonsai Image 4B! That's right, 1-bit diffusion models are here. Only ~3GB in size (FLUX.2 Klein 4B is 16GB). The most shocking part? It can run 100% locally in your browser. Try it now! 👇

WTF?! This changes image generation forever! 🤯 PrismML just released Binary and Ternary Bonsai Image 4B! That's right, 1-bit diffusion models are here. Only ~3GB in size (FLUX.2 Klein 4B is 16GB). The most shocking part? It can run 100% locally in your browser. Try it now! 👇

Xenova

179,727 Aufrufe • vor 27 Tagen

ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨

ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨

Zhijian Liu

49,004 Aufrufe • vor 3 Monaten

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

NVIDIA AI

60,384 Aufrufe • vor 1 Monat

🎉 Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving. 🌍 Day-0 NVIDIA AI Cosmos 3 world models: text, image, audio, video, and action, in and out. 🤖 Robot serving: DreamZero + OpenPI realtime API. 🎙️ Production TTS: Qwen3-TTS, Qwen3-Omni, VoxCPM2 and more. 🎨 Faster image/video/diffusion: Wan 2.2, HunyuanVideo 1.5, LTX-2.3. ⚡ Broader quantization (FP8/INT8, MXFP4/MXFP8, W4A16, ModelOpt) and hardware coverage. 339 commits, 124 contributors, 52 of them new. Thank you all. 🙌 🔗

🎉 Meet vLLM-Omni v0.22.0, a major upgrade for omnimodal world models and production-grade multimodal serving. 🌍 Day-0 NVIDIA AI Cosmos 3 world models: text, image, audio, video, and action, in and out. 🤖 Robot serving: DreamZero + OpenPI realtime API. 🎙️ Production TTS: Qwen3-TTS, Qwen3-Omni, VoxCPM2 and more. 🎨 Faster image/video/diffusion: Wan 2.2, HunyuanVideo 1.5, LTX-2.3. ⚡ Broader quantization (FP8/INT8, MXFP4/MXFP8, W4A16, ModelOpt) and hardware coverage. 339 commits, 124 contributors, 52 of them new. Thank you all. 🙌 🔗

vLLM

41,662 Aufrufe • vor 15 Tagen

Image Generation has landed🎨🤖 Introducing Image Arena - a battle of image generation models like FLUX, Stable Diffusion, Dall-E, Recraft, Ideogram, and more! Who will reign supreme? 1. Describe your desired image🎨 2. Two anonymous models output images 3. Vote for the winner! Enjoy! We will be releasing the leaderboard soon! More examples below👇

Image Generation has landed🎨🤖 Introducing Image Arena - a battle of image generation models like FLUX, Stable Diffusion, Dall-E, Recraft, Ideogram, and more! Who will reign supreme? 1. Describe your desired image🎨 2. Two anonymous models output images 3. Vote for the winner! Enjoy! We will be releasing the leaderboard soon! More examples below👇

lmarena.ai (formerly lmsys.org)

106,853 Aufrufe • vor 1 Jahr

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: (1/7)

Announcing Diffusion Forcing Transformer (DFoT), our new video diffusion algorithm that generates ultra-long videos of 800+ frames. DFoT enables History Guidance, a simple add-on to any existing video diffusion models for a quality boost. Website: (1/7)

Boyuan Chen

175,996 Aufrufe • vor 1 Jahr

WOW! 🤯 DINOv3 can run locally on your phone... from the browser! This unlocks endless possibilities for AI-powered web apps. 🤏 Model is tiny (only 15MB at 4-bit quantization) 🧠 Delivers powerful, high-resolution image features ✨ Works completely offline Try it yourself 👇

WOW! 🤯 DINOv3 can run locally on your phone... from the browser! This unlocks endless possibilities for AI-powered web apps. 🤏 Model is tiny (only 15MB at 4-bit quantization) 🧠 Delivers powerful, high-resolution image features ✨ Works completely offline Try it yourself 👇

Xenova

39,221 Aufrufe • vor 10 Monaten

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

This is amazing! You can now create high-quality 3D Scenes from a single image using Multi-Instance Diffusion Models (MIDI) 🔥

Gradio

41,770 Aufrufe • vor 1 Jahr

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

We open-sourced QeRL — Quantization-enhanced Reinforcement Learning ! 🧠 4-bit quantized RL training 💪 Train a 32B LLM on a single H100 GPU ⚙️ 1.7× faster overall training 🎯 Accuracy on par with bfloat16-level accuracy 🔥 Supports NVFP4 quantization format Moreover, we show that quantization helps exploration in RL training. Paper: Code: #NVIDIA #AIResearch #ReinforcementLearning #Quantization #LLM #EfficientAI

Yukang Chen

69,747 Aufrufe • vor 8 Monaten

Struggling with slow inference of diffusion and flow models? Check out the video below—I’ve been using our new FastGen library to achieve 7-28x acceleration for text-2-image and {text,image,video}-2-video generation without sacrificing visual fidelity!

Struggling with slow inference of diffusion and flow models? Check out the video below—I’ve been using our new FastGen library to achieve 7-28x acceleration for text-2-image and {text,image,video}-2-video generation without sacrificing visual fidelity!

Julius Berner

13,623 Aufrufe • vor 4 Monaten

Webstudio AI has arrived 🚀 Now you can build Websites 3x Faster using AI Try it:

Webstudio AI has arrived 🚀 Now you can build Websites 3x Faster using AI Try it:

Webstudio

1,537,891 Aufrufe • vor 2 Jahren

The quality, cost, and control you can achieve for upscaling + fixing plastic AI skin with open source models still amazes me... Models used: → Z-image-turbo for image gen (~3s) → SDXL + Lora for skin texture (~15s) → SeedVR2 for upscaling (~40s)

The quality, cost, and control you can achieve for upscaling + fixing plastic AI skin with open source models still amazes me... Models used: → Z-image-turbo for image gen (~3s) → SDXL + Lora for skin texture (~15s) → SeedVR2 for upscaling (~40s)

rob - comfyui

41,195 Aufrufe • vor 4 Monaten

The all-new MacBook Pro with M5 Pro and M5 Max pushes the boundaries of what you can accomplish from anywhere. Run advanced large language models on device and unlock capabilities that can't be done on any other laptop—all while maintaining exceptional battery life!

The all-new MacBook Pro with M5 Pro and M5 Max pushes the boundaries of what you can accomplish from anywhere. Run advanced large language models on device and unlock capabilities that can't be done on any other laptop—all while maintaining exceptional battery life!

Greg Joswiak

479,364 Aufrufe • vor 3 Monaten

Why isn't everyone using FLUX.2-dev-Turbo? Top quality, great at editing, 1.5s/image speed, 120 images for $1. My go-to for AI image gen/editing now 🔥

Why isn't everyone using FLUX.2-dev-Turbo? Top quality, great at editing, 1.5s/image speed, 120 images for $1. My go-to for AI image gen/editing now 🔥

Victor M

77,744 Aufrufe • vor 5 Monaten

FLUX.2 [klein] 4B + 9B are the newest BFL models in the Flux family—combining image generation + image editing in one compact architecture. Built for interactive workflows and quick iteration, with extremely fast inference on distilled variants. Two models. Two modes. ⚡️

FLUX.2 [klein] 4B + 9B are the newest BFL models in the Flux family—combining image generation + image editing in one compact architecture. Built for interactive workflows and quick iteration, with extremely fast inference on distilled variants. Two models. Two modes. ⚡️

ComfyUI

19,899 Aufrufe • vor 5 Monaten

You can now have an AI researcher running on your laptop 24/7 for free! Running Qwen3-35B-A3B with llama.cpp and a 4-bit quant from Unsloth

You can now have an AI researcher running on your laptop 24/7 for free! Running Qwen3-35B-A3B with llama.cpp and a 4-bit quant from Unsloth

Lewis Tunstall

118,162 Aufrufe • vor 1 Monat

Have you used quantization with an open source machine learning library, and wondered how quantization works? How can you preserve model accuracy as you compress from 32 bits to 16, 8, or even 2 bits? In our new short course, Quantization in Depth, taught by Hugging Face's Marc Sun and @younesbelkada, you'll learn to implement variants of linear quantization, such as asymmetric and symmetric modes, from scratch. You'll also quantize at different granularities (per-tensor, per-channel, per-group) to maintain performance. You’ll then construct a quantizer to compress any open source deep learning model’s dense layers to 8-bit precision. Finally, you’ll practice quantizing weights into 2 bits by packing four 2-bit weights into a single 8-bit integer. If you've ever run a large open source model on your laptop, you've likely benefited from someone's work in quantization. Come learn how this key technique works under the hood! Please sign up here:

Have you used quantization with an open source machine learning library, and wondered how quantization works? How can you preserve model accuracy as you compress from 32 bits to 16, 8, or even 2 bits? In our new short course, Quantization in Depth, taught by Hugging Face's Marc Sun and @younesbelkada, you'll learn to implement variants of linear quantization, such as asymmetric and symmetric modes, from scratch. You'll also quantize at different granularities (per-tensor, per-channel, per-group) to maintain performance. You’ll then construct a quantizer to compress any open source deep learning model’s dense layers to 8-bit precision. Finally, you’ll practice quantizing weights into 2 bits by packing four 2-bit weights into a single 8-bit integer. If you've ever run a large open source model on your laptop, you've likely benefited from someone's work in quantization. Come learn how this key technique works under the hood! Please sign up here:

Andrew Ng

198,616 Aufrufe • vor 2 Jahren

Big news! Now you can use other AI models in Adobe tools. Currently it's Google Imagen 3, GPT-o4 and FLUX along with our Firefly family of models (designed to be commercially safe). You can suggest more models you like to add! I've been advocating for this since day one 🥹🎉

Big news! Now you can use other AI models in Adobe tools. Currently it's Google Imagen 3, GPT-o4 and FLUX along with our Firefly family of models (designed to be commercially safe). You can suggest more models you like to add! I've been advocating for this since day one 🥹🎉

Kris Kashtanova

23,786 Aufrufe • vor 1 Jahr