Zhijian Liu's banner

Zhijian Liu

@zhijianliu_ • 6,695 subscribers

Assistant Professor @UCSanDiego (running https://t.co/dxXyoNa7HQ). Prev: RS @NVIDIA. PhD @MIT. Efficient AI. Views are my own.

Shorts

🔥 DFlash x MLX is happening! Shoutout to Arya Manjaramkar for the early work on this. We're building on the momentum. Native MLX support, more models (Qwen3.5), up to 4x faster. Lossless! 👉

🔥 DFlash x MLX is happening! Shoutout to Arya Manjaramkar for the early work on this. We're building on the momentum. Native MLX support, more models (Qwen3.5), up to 4x faster. Lossless! 👉

214,798 Aufrufe

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Reasoning VLAs can think. They just can't think fast. Until now. Introducing FlashDrive⚡ 🚀 716 ms → 159 ms on RTX PRO 6000 (up to 5.7×) ✅ Zero accuracy loss FlashDrive = streaming inference + DFlash speculative reasoning + ParoQuant W4A8 Real-time reasoning for autonomous driving is here!

Reasoning VLAs can think. They just can't think fast. Until now. Introducing FlashDrive⚡ 🚀 716 ms → 159 ms on RTX PRO 6000 (up to 5.7×) ✅ Zero accuracy loss FlashDrive = streaming inference + DFlash speculative reasoning + ParoQuant W4A8 Real-time reasoning for autonomous driving is here!

190,639 Aufrufe • vor 3 Monaten

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪

Reasoning LLMs generate very long chains-of-thought, so even small quantization errors add up. With AWQ, Qwen3-4B drops 71.0 → 68.2 on MMLU-Pro (~4% relative loss). 😬 ParoQuant fixes this! It keeps only the critical rotation pairs and fuses everything into a single kernel. Recovers most of the lost reasoning accuracy with minimal overhead — so 4-bit models stay strong at reasoning. 💪💪

171,119 Aufrufe • vor 5 Monaten

ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨

ParoQuant just got a big upgrade 🚀 ✅ Supports the new Qwen3.5 models ⚡ Now runs on MLX (fast local inference on Apple Silicon) 🧠 Preserves reasoning quality with 4-bit quantization We also built an agent demo running locally on my 4-year-old M2 Max. Can't wait to upgrade to an M5 Max and see what kind of magic we can do. ✨

49,358 Aufrufe • vor 4 Monaten

⚡ Speed of flash. Just 2 days after launch, DFlash is already running in SGLang (SGLang). With serving-engine support, we can now unlock speedup with higher concurrency, and we’ve quickly worked on a new demo based on it. We'll be cooking up more and better draft models over the next few weeks.🔥 Stay tuned!

⚡ Speed of flash. Just 2 days after launch, DFlash is already running in SGLang (SGLang). With serving-engine support, we can now unlock speedup with higher concurrency, and we’ve quickly worked on a new demo based on it. We'll be cooking up more and better draft models over the next few weeks.🔥 Stay tuned!

21,233 Aufrufe • vor 6 Monaten

Keine weiteren Inhalte verfügbar