Загрузка видео...

Не удалось загрузить видео

На главную

DeepSeek's (DeepSeek) latest—MLA, Multi-Token Prediction, 256 Experts, FP8 block quantization—shines with vLLM. Catch the office hours session were we discuss all the DeepSeek goodies and explore their integration and benchmarks with #vLLM.

14,093 просмотров • 1 год назад •via X (Twitter)

Комментарии: 2

Фото профиля Neural Magic (Acquired by Red Hat)
Neural Magic (Acquired by Red Hat)1 год назад

@vllm_project You can see the session slides here:

Фото профиля Lab4crypto
Lab4crypto1 год назад

🚨 New Weekly Quant Analysis! 🚨 Dive into my free, in-depth crypto market analysis, which is delivered straight to your inbox! 📊📬 🔍 Quantitative insights, growth trends, and risk management. 👀 Check out the preview below 👇 and subscribe now for early access!

Похожие видео

I trained a 100 million parameter DeepSeek V3 LLM from scratch Here's what you need to know. Previously I trained traditional GPT-2 architecture which has become obsolete with recent LLM advancements. Most recent models like Llama, Mistral, DeepSeek, and GPT-4 use latest architectures. ✦ Model Configuration of my SLM DeepSeek V3 - Parameters: 109,032,032 - Embedding Dimension: 512 - Layers: 8 - Heads: 8 - Experts (MoE): 8 - Experts per token: 2 ✦ DeepSeek brings major architectural changes: - Multi Head Latent Attention - Mixture of Experts - RMS Norm - Multi Token Prediction ✦ Dataset Challenge - TinyStories is great for learning SLMs. I trained GPT-2 on it previously with good results. - But I needed a more challenging dataset. - If I use TinyStories again on DeepSeek, how would I know MHLA, MoE or MTP works better than old architecture? - The old architecture can handle it, so new DeepSeek would too without utilizing latest advancements. That's why I moved to FineWeb-Edu dataset Thanks Yuvraj Singh (smolhub.com) for the suggestion for this dataset ✦ Training Journey - Rented A100 PCIe GPU and trained the model. - Did test runs. During final run, model was 65% trained but stopped due to glitch after 4 hours. - Fixed all edge cases and ran training again with increased config parameters. - Final training: 7 hours, 20,000 epochs 𝐓𝐨𝐭𝐚𝐥 𝐆𝐏𝐔 𝐜𝐨𝐬𝐭: $17 - $9.53 for main 7-hour run - $7.42 for experiments and demos ✦ Reflection Amazing long project that taught me latest architectural advancements. I'll reimplement and revisit after a few weeks because there's too much complexity, mostly in Multi Head Latent Attention part. Need to make concepts stronger. Code Final trained Model Dataset Resources Huge shoutout to Raj Dandekar again for creating one of the most detailed video series about DeepSeek - this was my primary resource for the implementation. Playlist Blogs by Maarten Grootendorst These are excellent visual blogs to understand MoE in detail. Thanks Maarten for your amazing contributions to the community through your books and blogs Blogs on MoE Implemention of MoE from scratch by @aviTwit3 One of the most detailed blogs on implementing Mixture of Experts. Thanks Avinash for this blog - it helped me understand Mixture of Experts much better. If you're someone in the 𝐌𝐋 & 𝐋𝐋𝐌 space, would love to 𝐜𝐨𝐧𝐧𝐞𝐜𝐭 and discuss this field in general, so give a follow up for that.

Mayank Pratap Singh

48,005 просмотров • 11 месяцев назад