Mayank Pratap Singh's banner

Mayank Pratap Singh

@Mayank_022 • 1,815 subscribers

Architecting, post-training, and fine-tuning multimodal LLMs and audio deep learning models. gen ai • llms • ml • mlops

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

I tested Hugging Face ml-intern, given the prompt "Fine-tune a Segment Anything Model (SAM) on a useful medical dataset. Train the model, and provide a comprehensive tutorial in a Jupyter Notebook file. Additionally, create a Hugging Face article/blog post documenting everything you have done." It did it all autonomously: - Researched via hf_papers & searched GitHub/HF Hub - Found an HF dataset & wrote the finetuning script - Trained it using HF compute (took ~1 hour) - Pushed the weights & wrote the article Here are the model weights, code, and the blog it generated: hf article model weights Awesome stuff Aksel , looking forward to use this. 🔥

I tested Hugging Face ml-intern, given the prompt "Fine-tune a Segment Anything Model (SAM) on a useful medical dataset. Train the model, and provide a comprehensive tutorial in a Jupyter Notebook file. Additionally, create a Hugging Face article/blog post documenting everything you have done." It did it all autonomously: - Researched via hf_papers & searched GitHub/HF Hub - Found an HF dataset & wrote the finetuning script - Trained it using HF compute (took ~1 hour) - Pushed the weights & wrote the article Here are the model weights, code, and the blog it generated: hf article model weights Awesome stuff Aksel , looking forward to use this. 🔥

Mayank Pratap Singh

94,081 просмотров • 2 месяцев назад

I coded a Speech-to-Text model from scratch. 𝐇𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐛𝐥𝐨𝐠 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞: No APIs. No pre-trained models. Just PyTorch, an A100 GPU, and hours of debugging. This started months ago. I wanted to understand how machines hear. Not surface-level understanding. I wanted to build the whole thing myself. So I built it piece by piece: autoencoders, VAEs, VQ-VAEs, Residual Vector Quantization, and CTC loss. Each one took days to get right. Trained for 3 hours on 13,100 audio clips. Got complete garbage. Changed the tokenizer from BPE to character-level. Rechecked everything. Asked AVB who built STT models before. His answer: these models are tricky to train and need days of compute, not hours. Cut the dataset to 200 clips. After 2 hours, actual words appeared. Overfitted? Absolutely. But watching noise turn into recognizable English was satisfying. I have made a blog about this as well so you can learn about the same and my process - Audio fundamentals and waveform representation - Why attention breaks on raw audio - Convolutional downsampling - Transformer encoder with positional encoding - Vector Quantization, straight-through estimator, and RVQ - CTC loss and greedy decoding - Full training loop with VQ loss warmup - What went wrong and what finally worked Resources: - Blog: - Code: More Resoures CTC loss AVB videos SoundStream Paper LJ speech dataset wav2vec paper RVQ blog Next up: I've already trained two TTS architectures from scratch. Video post about those coming soon. But first, I'm dropping a visual breakdown of Vision Transformers, covering how they work and how to fine-tune them. Follow me Mayank Pratap Singh you're into audio deep learning. Repost so others can find this

I coded a Speech-to-Text model from scratch. 𝐇𝐞𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐛𝐥𝐨𝐠 𝐟𝐨𝐫 𝐭𝐡𝐞 𝐬𝐚𝐦𝐞: No APIs. No pre-trained models. Just PyTorch, an A100 GPU, and hours of debugging. This started months ago. I wanted to understand how machines hear. Not surface-level understanding. I wanted to build the whole thing myself. So I built it piece by piece: autoencoders, VAEs, VQ-VAEs, Residual Vector Quantization, and CTC loss. Each one took days to get right. Trained for 3 hours on 13,100 audio clips. Got complete garbage. Changed the tokenizer from BPE to character-level. Rechecked everything. Asked AVB who built STT models before. His answer: these models are tricky to train and need days of compute, not hours. Cut the dataset to 200 clips. After 2 hours, actual words appeared. Overfitted? Absolutely. But watching noise turn into recognizable English was satisfying. I have made a blog about this as well so you can learn about the same and my process - Audio fundamentals and waveform representation - Why attention breaks on raw audio - Convolutional downsampling - Transformer encoder with positional encoding - Vector Quantization, straight-through estimator, and RVQ - CTC loss and greedy decoding - Full training loop with VQ loss warmup - What went wrong and what finally worked Resources: - Blog: - Code: More Resoures CTC loss AVB videos SoundStream Paper LJ speech dataset wav2vec paper RVQ blog Next up: I've already trained two TTS architectures from scratch. Video post about those coming soon. But first, I'm dropping a visual breakdown of Vision Transformers, covering how they work and how to fine-tune them. Follow me Mayank Pratap Singh you're into audio deep learning. Repost so others can find this

Mayank Pratap Singh

51,382 просмотров • 3 месяцев назад

I trained a 100 million parameter DeepSeek V3 LLM from scratch Here's what you need to know. Previously I trained traditional GPT-2 architecture which has become obsolete with recent LLM advancements. Most recent models like Llama, Mistral, DeepSeek, and GPT-4 use latest architectures. ✦ Model Configuration of my SLM DeepSeek V3 - Parameters: 109,032,032 - Embedding Dimension: 512 - Layers: 8 - Heads: 8 - Experts (MoE): 8 - Experts per token: 2 ✦ DeepSeek brings major architectural changes: - Multi Head Latent Attention - Mixture of Experts - RMS Norm - Multi Token Prediction ✦ Dataset Challenge - TinyStories is great for learning SLMs. I trained GPT-2 on it previously with good results. - But I needed a more challenging dataset. - If I use TinyStories again on DeepSeek, how would I know MHLA, MoE or MTP works better than old architecture? - The old architecture can handle it, so new DeepSeek would too without utilizing latest advancements. That's why I moved to FineWeb-Edu dataset Thanks Yuvraj Singh (smolhub.com) for the suggestion for this dataset ✦ Training Journey - Rented A100 PCIe GPU and trained the model. - Did test runs. During final run, model was 65% trained but stopped due to glitch after 4 hours. - Fixed all edge cases and ran training again with increased config parameters. - Final training: 7 hours, 20,000 epochs 𝐓𝐨𝐭𝐚𝐥 𝐆𝐏𝐔 𝐜𝐨𝐬𝐭: $17 - $9.53 for main 7-hour run - $7.42 for experiments and demos ✦ Reflection Amazing long project that taught me latest architectural advancements. I'll reimplement and revisit after a few weeks because there's too much complexity, mostly in Multi Head Latent Attention part. Need to make concepts stronger. Code Final trained Model Dataset Resources Huge shoutout to Raj Dandekar again for creating one of the most detailed video series about DeepSeek - this was my primary resource for the implementation. Playlist Blogs by Maarten Grootendorst These are excellent visual blogs to understand MoE in detail. Thanks Maarten for your amazing contributions to the community through your books and blogs Blogs on MoE Implemention of MoE from scratch by @aviTwit3 One of the most detailed blogs on implementing Mixture of Experts. Thanks Avinash for this blog - it helped me understand Mixture of Experts much better. If you're someone in the 𝐌𝐋 & 𝐋𝐋𝐌 space, would love to 𝐜𝐨𝐧𝐧𝐞𝐜𝐭 and discuss this field in general, so give a follow up for that.

I trained a 100 million parameter DeepSeek V3 LLM from scratch Here's what you need to know. Previously I trained traditional GPT-2 architecture which has become obsolete with recent LLM advancements. Most recent models like Llama, Mistral, DeepSeek, and GPT-4 use latest architectures. ✦ Model Configuration of my SLM DeepSeek V3 - Parameters: 109,032,032 - Embedding Dimension: 512 - Layers: 8 - Heads: 8 - Experts (MoE): 8 - Experts per token: 2 ✦ DeepSeek brings major architectural changes: - Multi Head Latent Attention - Mixture of Experts - RMS Norm - Multi Token Prediction ✦ Dataset Challenge - TinyStories is great for learning SLMs. I trained GPT-2 on it previously with good results. - But I needed a more challenging dataset. - If I use TinyStories again on DeepSeek, how would I know MHLA, MoE or MTP works better than old architecture? - The old architecture can handle it, so new DeepSeek would too without utilizing latest advancements. That's why I moved to FineWeb-Edu dataset Thanks Yuvraj Singh (smolhub.com) for the suggestion for this dataset ✦ Training Journey - Rented A100 PCIe GPU and trained the model. - Did test runs. During final run, model was 65% trained but stopped due to glitch after 4 hours. - Fixed all edge cases and ran training again with increased config parameters. - Final training: 7 hours, 20,000 epochs 𝐓𝐨𝐭𝐚𝐥 𝐆𝐏𝐔 𝐜𝐨𝐬𝐭: $17 - $9.53 for main 7-hour run - $7.42 for experiments and demos ✦ Reflection Amazing long project that taught me latest architectural advancements. I'll reimplement and revisit after a few weeks because there's too much complexity, mostly in Multi Head Latent Attention part. Need to make concepts stronger. Code Final trained Model Dataset Resources Huge shoutout to Raj Dandekar again for creating one of the most detailed video series about DeepSeek - this was my primary resource for the implementation. Playlist Blogs by Maarten Grootendorst These are excellent visual blogs to understand MoE in detail. Thanks Maarten for your amazing contributions to the community through your books and blogs Blogs on MoE Implemention of MoE from scratch by @aviTwit3 One of the most detailed blogs on implementing Mixture of Experts. Thanks Avinash for this blog - it helped me understand Mixture of Experts much better. If you're someone in the 𝐌𝐋 & 𝐋𝐋𝐌 space, would love to 𝐜𝐨𝐧𝐧𝐞𝐜𝐭 and discuss this field in general, so give a follow up for that.

Mayank Pratap Singh

48,005 просмотров • 11 месяцев назад

Больше нет контента для загрузки