Hongyang Zhang's banner
Hongyang Zhang's profile picture

Hongyang Zhang

@hongyangzh2,777 subscribers

Assistant Professor in CS @UWaterloo @VectorInst. CMU ML PhD @SCSatCMU. @PKU1898 Alumni. Proud developer of EAGLE & TRADES.

Shorts

Introduce EAGLE, a new method for fast LLM decoding based on compression: - 3x🚀than vanilla - 2x🚀 than Lookahead (on its benchmark) - 1.6x🚀 than Medusa (on its benchmark) - provably maintains text distribution - trainable (in 1~2 days) and testable on RTX 3090s Playground: Blog: Code: ⚒️First Principle: Compression! Yi Ma We find that the sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model. 🙏Acknowledge: This project is greatly inspired by the Medusa team (Tianle Cai @yli3521 Zhengyang Geng Hongwu Peng Tri Dao), the Lookahead team (Hao Zhang LMSYS Org), and others. Joint work with Yuhui Li and Chao Zhang

Introduce EAGLE, a new method for fast LLM decoding based on compression: - 3x🚀than vanilla - 2x🚀 than Lookahead (on its benchmark) - 1.6x🚀 than Medusa (on its benchmark) - provably maintains text distribution - trainable (in 1~2 days) and testable on RTX 3090s Playground: Blog: Code: ⚒️First Principle: Compression! Yi Ma We find that the sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model. 🙏Acknowledge: This project is greatly inspired by the Medusa team (Tianle Cai @yli3521 Zhengyang Geng Hongwu Peng Tri Dao), the Lookahead team (Hao Zhang LMSYS Org), and others. Joint work with Yuhui Li and Chao Zhang

118,810 Aufrufe

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new scaling law: more training data, better speedup - Apache 2.0 Paper: Code: SGLang version: ⚒️Takeaway: Introducing training-time test, a novel draft model training technique: we replace feature prediction with direct token prediction and shift from top-layer-only features to multi-layer feature fusion. This approach unlocks a new scaling law previously undiscovered in EAGLE and EAGLE-2. 🙏Acknowledge: We would like to thank the SGLang team (zhyncs Lianmin Zheng Ying Sheng James Liu, Ke Bao, and others LMSYS Org) for their merge and careful evaluation of EAGLE-3 on SGLang. 🤝Want to collaborate? We're a small academic group with limited GPU resources. If you're interested in supporting our next version of EAGLE or would like us to train a preliminary version tailored to a specific model, please get in touch! Joint work with Yuhui Li, Fangyun Wei, and Chao Zhang

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new scaling law: more training data, better speedup - Apache 2.0 Paper: Code: SGLang version: ⚒️Takeaway: Introducing training-time test, a novel draft model training technique: we replace feature prediction with direct token prediction and shift from top-layer-only features to multi-layer feature fusion. This approach unlocks a new scaling law previously undiscovered in EAGLE and EAGLE-2. 🙏Acknowledge: We would like to thank the SGLang team (zhyncs Lianmin Zheng Ying Sheng James Liu, Ke Bao, and others LMSYS Org) for their merge and careful evaluation of EAGLE-3 on SGLang. 🤝Want to collaborate? We're a small academic group with limited GPU resources. If you're interested in supporting our next version of EAGLE or would like us to train a preliminary version tailored to a specific model, please get in touch! Joint work with Yuhui Li, Fangyun Wei, and Chao Zhang

41,164 Aufrufe

Can $300 RTX3060 be faster than $10,000 A100 for LLM decoding?🧐Yes Introduce EAGLE-2: - 4.26x🚀vanilla (13B) - 1.4x🚀EAGLE (13B) - No extra training than EAGLE - Maintain distribution Paper: Code: Demo:

Can $300 RTX3060 be faster than $10,000 A100 for LLM decoding?🧐Yes Introduce EAGLE-2: - 4.26x🚀vanilla (13B) - 1.4x🚀EAGLE (13B) - No extra training than EAGLE - Maintain distribution Paper: Code: Demo:

35,677 Aufrufe

Videos

Keine weiteren Inhalte verfügbar