
Zhijian Liu
@zhijianliu_ • 6,265 subscribers
Assistant Professor @UCSanDiego (running https://t.co/dxXyoNa7HQ). Prev: RS @NVIDIA. PhD @MIT. Efficient AI. Views are my own.
Shorts
Videos

Reasoning VLAs can think. They just can't think fast. Until now. Introducing FlashDrive⚡ 🚀 716 ms → 159 ms on RTX PRO 6000 (up to 5.7×) ✅ Zero accuracy loss FlashDrive = streaming inference + DFlash speculative reasoning + ParoQuant W4A8 Real-time reasoning for autonomous driving is here!
Zhijian Liu164,213 Aufrufe • vor 1 Monat

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪
Zhijian Liu170,758 Aufrufe • vor 3 Monaten

ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨
Zhijian Liu49,004 Aufrufe • vor 3 Monaten

⚡ Speed of flash. Just 2 days after launch, DFlash is already running in SGLang (SGLang). With serving-engine support, we can now unlock speedup with higher concurrency, and we’ve quickly worked on a new demo based on it. We'll be cooking up more and better draft models over the next few weeks.🔥 Stay tuned!
Zhijian Liu21,233 Aufrufe • vor 4 Monaten
Keine weiteren Inhalte verfügbar