Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

🚀 The 4-bit era has arrived! Meet #SVDQuant, our new W4A4 quantization paradigm for diffusion models. Now, 12B FLUX can run on a 16GB 4090 laptop without offloading—with 3x speedups over W4A16 models (like NF4) while maintaining top-tier image quality. #AI #Quantization. 1/7

50,162 Aufrufe • vor 1 Jahr •via X (Twitter)

10 Kommentare

Profilbild von Muyang Li
Muyang Livor 1 Jahr

Quantization effectively accelerates LLM inference, primarily by cutting weight-loading latency. But for compute-heavy diffusion models, weight quantization alone doesn’t boost speed. For real speedup, we need to quantize both weights and activations to the same bit width. 2/7

Profilbild von Muyang Li
Muyang Livor 1 Jahr

However, W4A4 quantization is tough with massive outliers. #SVDQuant addresses this by smoothing activations and merging its outliers into weights. It then applies SVD to the weights to add a 16-bit low-rank component, which absorbs the quantization difficulty. 3/7

Profilbild von Muyang Li
Muyang Livor 1 Jahr

Running the low-rank branch separately incurs high latency from redundant memory access. Our co-designed #Nunchaku engine uses kernel fusion to share inputs and outputs between branches, cutting memory access and halving kernel calls, reducing overhead to negligible 5–10%.4/7

Profilbild von Muyang Li
Muyang Livor 1 Jahr

On 12B FLUX.1-dev, we cut memory use by 3.6× compared to BF16 and, on a 16GB 4090 GPU, speeds up by 8.7× over 16-bit and 3× over the NF4 W4A16 baseline without loss of image quality. On PixArt-∑, it also outperforms other W4A4 and even W4A8 models in visual quality. 5/7

Profilbild von Muyang Li
Muyang Livor 1 Jahr

Nunchaku removes redundant memory access, allowing SVDQuant to work seamlessly with off-the-shelf LoRA by running it in a separate branch, without re-quantization. Our INT4 FLUX.1-dev model adapts to 5 distinct styles, matching the image quality of the original 16-bit version.6/7

Profilbild von Muyang Li
Muyang Livor 1 Jahr

Paper: Code: Demo: Website: Blog: Collaborate w/@syn7xavier @ZhekaiZhang @tianle_cai @xiuyu_l @jerry_gjx @xieenze_jr @chenlin_meng @junyanz89 @songhan_mit

Profilbild von Ramon Guthrie
Ramon Guthrievor 1 Jahr

The big question is does this support @ComfyUI, Forge and does this work with Loras and Controlnets?

Profilbild von Spacer
Spacervor 1 Jahr

Holy moly, 3x is massive! Would video models like Mochi be compatible with SVDQuant?

Profilbild von Muyang Li
Muyang Livor 1 Jahr

I think so since our method is general purpose. Will work on it.

Profilbild von Danny Ki
Danny Kivor 1 Jahr

Really 4-bit? What if numbers have feelings too?

Ähnliche Videos