Hongyang Zhang's banner

Hongyang Zhang

@hongyangzh • 2,793 subscribers

Co-founder of EAGLE AI @EagleCorp. Professor in CS @UWaterloo @VectorInst. CMU ML PhD @SCSatCMU. @PKU1898 Alumni.

Shorts

Introduce EAGLE, a new method for fast LLM decoding based on compression: - 3x🚀than vanilla - 2x🚀 than Lookahead (on its benchmark) - 1.6x🚀 than Medusa (on its benchmark) - provably maintains text distribution - trainable (in 1~2 days) and testable on RTX 3090s Playground: Blog: Code: ⚒️First Principle: Compression! Yi Ma We find that the sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model. 🙏Acknowledge: This project is greatly inspired by the Medusa team (Tianle Cai ✈️ ICML🇰🇷 @yli3521 Zhengyang Geng ✈️ ICML Hongwu Peng Tri Dao), the Lookahead team (Hao Zhang LMSYS Org), and others. Joint work with Yuhui Li and Chao Zhang

Introduce EAGLE, a new method for fast LLM decoding based on compression: - 3x🚀than vanilla - 2x🚀 than Lookahead (on its benchmark) - 1.6x🚀 than Medusa (on its benchmark) - provably maintains text distribution - trainable (in 1~2 days) and testable on RTX 3090s Playground: Blog: Code: ⚒️First Principle: Compression! Yi Ma We find that the sequence of second-top-layer features is compressible, making the prediction of subsequent feature vectors from previous ones easy by a small model. 🙏Acknowledge: This project is greatly inspired by the Medusa team (Tianle Cai ✈️ ICML🇰🇷 @yli3521 Zhengyang Geng ✈️ ICML Hongwu Peng Tri Dao), the Lookahead team (Hao Zhang LMSYS Org), and others. Joint work with Yuhui Li and Chao Zhang

118,844 Aufrufe

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new scaling law: more training data, better speedup - Apache 2.0 Paper: Code: SGLang version: ⚒️Takeaway: Introducing training-time test, a novel draft model training technique: we replace feature prediction with direct token prediction and shift from top-layer-only features to multi-layer feature fusion. This approach unlocks a new scaling law previously undiscovered in EAGLE and EAGLE-2. 🙏Acknowledge: We would like to thank the SGLang team (zhyncs Lianmin Zheng Ying Sheng James Liu, Ke Bao, and others LMSYS Org) for their merge and careful evaluation of EAGLE-3 on SGLang. 🤝Want to collaborate? We're a small academic group with limited GPU resources. If you're interested in supporting our next version of EAGLE or would like us to train a preliminary version tailored to a specific model, please get in touch! Joint work with Yuhui Li, Fangyun Wei, and Chao Zhang

Jointly announcing EAGLE-3 with SGLang: Setting a new record in LLM inference acceleration! - 5x🚀than vanilla (on HF) - 1.4x🚀than EAGLE-2 (on HF) - A record of ~400 TPS on LLama 3.1 8B with a single H100 (on SGLang) - 1.65x🚀in latency even for large bs=64 (on SGLang) - A new scaling law: more training data, better speedup - Apache 2.0 Paper: Code: SGLang version: ⚒️Takeaway: Introducing training-time test, a novel draft model training technique: we replace feature prediction with direct token prediction and shift from top-layer-only features to multi-layer feature fusion. This approach unlocks a new scaling law previously undiscovered in EAGLE and EAGLE-2. 🙏Acknowledge: We would like to thank the SGLang team (zhyncs Lianmin Zheng Ying Sheng James Liu, Ke Bao, and others LMSYS Org) for their merge and careful evaluation of EAGLE-3 on SGLang. 🤝Want to collaborate? We're a small academic group with limited GPU resources. If you're interested in supporting our next version of EAGLE or would like us to train a preliminary version tailored to a specific model, please get in touch! Joint work with Yuhui Li, Fangyun Wei, and Chao Zhang

42,200 Aufrufe

Can $300 RTX3060 be faster than $10,000 A100 for LLM decoding?🧐Yes Introduce EAGLE-2: - 4.26x🚀vanilla (13B) - 1.4x🚀EAGLE (13B) - No extra training than EAGLE - Maintain distribution Paper: Code: Demo:

Can $300 RTX3060 be faster than $10,000 A100 for LLM decoding?🧐Yes Introduce EAGLE-2: - 4.26x🚀vanilla (13B) - 1.4x🚀EAGLE (13B) - No extra training than EAGLE - Maintain distribution Paper: Code: Demo:

35,689 Aufrufe

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Introducing The Matrix --- a foundation world model for generating infinite-length, hyper-realistic videos with real-time, frame-level control: - Infinite-length video generation - 720p high-quality rendering - Real-time, frame-level control at 16 FPS - Generalization to real-world video control 🔗Blog: 📄Paper: 💻Code & Playable Demo: Coming soon! Key Innovation: A brand new technique called the shift-window denoise process model, enabling auto-regressive generation for diffusion and consistency models in real-time. Special thanks to project leader Ruili Feng and the entire Matrix team for their dedication and hard work over the year-long project.

178,322 Aufrufe • vor 1 Jahr

Keine weiteren Inhalte verfügbar