Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x... increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: Built with Reese Chong. Check out the repos and blog if you want to learn more. Shoutout to Modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc sunny madra Gavinshow more

Aadi Kulshrestha

3,427 subscribers

809,167 просмотров • 1 месяц назад •via X (Twitter)

Образование Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA loads Try it out: s/o Bryce, the CUDA Colonel and the cuTile team at NVIDIA for lending me Blackwell GPU access :) cc sunny madra Gavin

ani

806,064 просмотров • 2 месяцев назад

Check out mistral.rs, our #Rust-based open source inference engine allowing for fast #LLM serving for a variety of architectures including X-LoRA mixture-of-expert (MoE) models, Llama-3, Mistral/Mixtral, Gemma & many others. Built on the Hugging Face #Candle framework for #Rust w/ custom CUDA kernels in the backend (as well as support for Metal, Apple Accelerate, and Intel MKL for CPU use), you can easily create a REST API OpenAI compatible server or run via Python bindings. Key features include: ✅Prefix caching, continuous batching ✅Flash Attention V2 ✅Device offloading ✅GGUF or Hugging Face models ✅2, 3, 4, 5, 6 and 8 bit quantization ✅X-LoRA MoE non-granular scalings for fast inference ✅Grammar support ✅Continuous batching ✅LoRA support with weight merging ✅LlamaIndex 🦙 integration ...and much more. Incorporation into our GraphReasoning multi-agent modeling framework & LlamaIndex 🦙 allows you to combine in-context learning with adversarial agentic strategies, to dive deep into complex scientific analyses, such as to predict material behaviors, generate hypotheses, analyze papers and data, develop new research concepts, and much more. Check out mistral.rs: Join our Discord here: Rust Trending Rust Language

Check out mistral.rs, our #Rust-based open source inference engine allowing for fast #LLM serving for a variety of architectures including X-LoRA mixture-of-expert (MoE) models, Llama-3, Mistral/Mixtral, Gemma & many others. Built on the Hugging Face #Candle framework for #Rust w/ custom CUDA kernels in the backend (as well as support for Metal, Apple Accelerate, and Intel MKL for CPU use), you can easily create a REST API OpenAI compatible server or run via Python bindings. Key features include: ✅Prefix caching, continuous batching ✅Flash Attention V2 ✅Device offloading ✅GGUF or Hugging Face models ✅2, 3, 4, 5, 6 and 8 bit quantization ✅X-LoRA MoE non-granular scalings for fast inference ✅Grammar support ✅Continuous batching ✅LoRA support with weight merging ✅LlamaIndex 🦙 integration ...and much more. Incorporation into our GraphReasoning multi-agent modeling framework & LlamaIndex 🦙 allows you to combine in-context learning with adversarial agentic strategies, to dive deep into complex scientific analyses, such as to predict material behaviors, generate hypotheses, analyze papers and data, develop new research concepts, and much more. Check out mistral.rs: Join our Discord here: Rust Trending Rust Language

Markus J. Buehler

73,575 просмотров • 2 лет назад

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

Awni Hannun

42,751 просмотров • 10 месяцев назад

Just launched #CES2026, the new open-source NVIDIA Nemotron Speech ASR model is here to solve latency drift and redundant compute. Its cache-aware streaming architecture eliminates the need for buffered inference, giving you stable, sub-100ms latency (24ms median T-T-F) and up to 3x more throughput on your GPU. 🤗 Read the technical blog with real-world results from Daily and Modal on Hugging Face:

Just launched #CES2026, the new open-source NVIDIA Nemotron Speech ASR model is here to solve latency drift and redundant compute. Its cache-aware streaming architecture eliminates the need for buffered inference, giving you stable, sub-100ms latency (24ms median T-T-F) and up to 3x more throughput on your GPU. 🤗 Read the technical blog with real-world results from Daily and Modal on Hugging Face:

NVIDIA AI Developer

138,370 просмотров • 5 месяцев назад

Congrats to the Kimi.ai team! This is awesome. Great to see this level of research coming from open-source frontier model labs. I liked the paper so much I built a Rust implementation of it ;) Full AttnRes + Block AttnRes with two-phase inference, built using Burn (tensor library and Deep Learning Framework, in Rust, by Tracel AI). Runs on CPU, CUDA, Metal, wgpu. Includes an interactive TUI that trains a model live and visualizes depth attention evolving from uniform to selective in real time. Repo link and more on what is implemented in the comments.

Congrats to the Kimi.ai team! This is awesome. Great to see this level of research coming from open-source frontier model labs. I liked the paper so much I built a Rust implementation of it ;) Full AttnRes + Block AttnRes with two-phase inference, built using Burn (tensor library and Deep Learning Framework, in Rust, by Tracel AI). Runs on CPU, CUDA, Metal, wgpu. Includes an interactive TUI that trains a model live and visualizes depth attention evolving from uniform to selective in real time. Repo link and more on what is implemented in the comments.

abdel

94,633 просмотров • 3 месяцев назад

Ubuntu 26.04 LTS, codenamed #ResoluteRaccoon, is now available to download. 🦝 Resolute Raccoon builds on the resilience-focused improvements introduced in interim releases, with TPM-backed full-disk encryption, improved support for application permission prompting, Livepatch updates for Arm-based servers, and Rust-based utilities for enhanced memory safety. This release also brings native support for industry-leading AI/ML toolkits like NVIDIA CUDA and AMD ROCm, making Ubuntu 26.04 LTS the ideal platform for AI development and production workloads. Install now: Learn more about the release:

Ubuntu 26.04 LTS, codenamed #ResoluteRaccoon, is now available to download. 🦝 Resolute Raccoon builds on the resilience-focused improvements introduced in interim releases, with TPM-backed full-disk encryption, improved support for application permission prompting, Livepatch updates for Arm-based servers, and Rust-based utilities for enhanced memory safety. This release also brings native support for industry-leading AI/ML toolkits like NVIDIA CUDA and AMD ROCm, making Ubuntu 26.04 LTS the ideal platform for AI development and production workloads. Install now: Learn more about the release:

Ubuntu

651,348 просмотров • 1 месяц назад

$Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. Subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.$

Introducing SubQ - a major breakthrough in LLM intelligence. It is the first model built on a fully sub-quadratic sparse-attention architecture (SSA), And the first frontier model with a 12 million token context window which is: - 52x faster than FlashAttention at 1MM tokens - Less than 5% the cost of Opus Transformer-based LLMs waste compute by processing every possible relationship between words (standard attention). Only a small fraction actually matter. Subquadratic finds and focuses only on the ones that do. That's nearly 1,000x less compute and a new way for LLMs to scale.

Alexander Whedon

12,821,247 просмотров • 1 месяц назад

Luminal ( is creating PyTorch for Production – an ML compiler that generates blazingly fast CUDA kernels and makes deploying to production one line of code. Congrats on the launch, Jake Stevens, Joe Fioti, and Matthew Gunton!

Luminal ( is creating PyTorch for Production – an ML compiler that generates blazingly fast CUDA kernels and makes deploying to production one line of code. Congrats on the launch, Jake Stevens, Joe Fioti, and Matthew Gunton!

Y Combinator

98,496 просмотров • 11 месяцев назад

🎉 Stoked to share The AI CUDA Engineer 👷 - our end-to-end approach for automating the design and optimization of CUDA Kernels using agentic systems. Blog 📰: Paper 📜: WebUI 📈: Dataset 💽: Awesome team work done with Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru 🤗

🎉 Stoked to share The AI CUDA Engineer 👷 - our end-to-end approach for automating the design and optimization of CUDA Kernels using agentic systems. Blog 📰: Paper 📜: WebUI 📈: Dataset 💽: Awesome team work done with Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru 🤗

Robert Lange

42,174 просмотров • 1 год назад

YOUR PARENTS PAID FOR THE CUDA MOAT! The #1 contributor to the CUDA MOAT isn't the the developers at NVIDIA, but it is the millions of developers outside of NVIDIA that invent new algorithms for CUDA like Flash Attention. For most of them, it started with an GeForce gaming GPU. NVIDIA is the only companies that has an reasonable good developer stack on consumer grade GPUs. As people grow up beyond playing CSGO & League of Legends & Minecraft, they either become anime weeaboos or they start programming on their existing computer with has an GeForce GPU

YOUR PARENTS PAID FOR THE CUDA MOAT! The #1 contributor to the CUDA MOAT isn't the the developers at NVIDIA, but it is the millions of developers outside of NVIDIA that invent new algorithms for CUDA like Flash Attention. For most of them, it started with an GeForce gaming GPU. NVIDIA is the only companies that has an reasonable good developer stack on consumer grade GPUs. As people grow up beyond playing CSGO & League of Legends & Minecraft, they either become anime weeaboos or they start programming on their existing computer with has an GeForce GPU

SemiAnalysis

25,230 просмотров • 2 месяцев назад

Finally, I'm done with the basics of backend engineering on the all-in-one resources to learn backend engineering I'm working on This basic section includes - Internet - HTTP - Servers - Web Dev fundamentals - Fundamentals of Operating Systems - A server-side language -> JavaScript | Go | Rust | Node - A framework -> Express | Nestjs | Django | Laravel | Spring Boot This is the boring part. Don't worry. We are getting into the fun part soon. Still working on it, updating it, Adding more, and making it better for you. Follow me. Retweet. Comment "Backend." I will DM you the PDF version when it's out. Enjoy the web version, and send in your feedback on how we can improve it for everyone.

Finally, I'm done with the basics of backend engineering on the all-in-one resources to learn backend engineering I'm working on This basic section includes - Internet - HTTP - Servers - Web Dev fundamentals - Fundamentals of Operating Systems - A server-side language -> JavaScript | Go | Rust | Node - A framework -> Express | Nestjs | Django | Laravel | Spring Boot This is the boring part. Don't worry. We are getting into the fun part soon. Still working on it, updating it, Adding more, and making it better for you. Follow me. Retweet. Comment "Backend." I will DM you the PDF version when it's out. Enjoy the web version, and send in your feedback on how we can improve it for everyone.

Solomon Eseme

92,180 просмотров • 2 лет назад

GeoConfirmed News! Our amazing IT team has launched the GeoConfirmed 2.0 website! The new platform offers many more features for every investigator working with our data. What’s new? Find out in our first video below... and visit to try it out yourself. THANK YOU FOR YOUR ATTENTION TO THIS MATTER 😉 (With special thanks to Ukraine Control Map and Unit Observer for their very important work()

GeoConfirmed News! Our amazing IT team has launched the GeoConfirmed 2.0 website! The new platform offers many more features for every investigator working with our data. What’s new? Find out in our first video below... and visit to try it out yourself. THANK YOU FOR YOUR ATTENTION TO THIS MATTER 😉 (With special thanks to Ukraine Control Map and Unit Observer for their very important work()

GeoConfirmed

33,490 просмотров • 2 месяцев назад

Mixtral 8x7B Instruct with AWQ & Flash Attention 2 🔥 All in ~24GB GPU VRAM! With the latest release of AutoAWQ - you can now run Mixtral 8x7B MoE with Flash Attention 2 for blazingly fast inference. All in < 10 lines of code. The only real change except loading AWQ weights is to pass attn_implementation="flash_attention_2" over to the .from_pretrained call whilst loading the model. Here's a full run through: 1. Install AutoAWQ and transformers pip install autoawq git+ com/huggingface/transformers.git 2. Initialise the tokeniser and the model from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer model_id = "casperhansen/mixtral-instruct-awq" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, low_cpu_mem_usage=True, device_map="cuda:0", attn_implementation="flash_attention_2") 3. Initialise the TextStreamer streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) 4. Tokenise the inputs tokens = tokenizer( text, return_tensors='pt' ).input_ids.to("cuda:0") 5. Generate! generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=512 ) That's it! 🤗

Mixtral 8x7B Instruct with AWQ & Flash Attention 2 🔥 All in ~24GB GPU VRAM! With the latest release of AutoAWQ - you can now run Mixtral 8x7B MoE with Flash Attention 2 for blazingly fast inference. All in < 10 lines of code. The only real change except loading AWQ weights is to pass attn_implementation="flash_attention_2" over to the .from_pretrained call whilst loading the model. Here's a full run through: 1. Install AutoAWQ and transformers pip install autoawq git+ com/huggingface/transformers.git 2. Initialise the tokeniser and the model from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer model_id = "casperhansen/mixtral-instruct-awq" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained( model_id, low_cpu_mem_usage=True, device_map="cuda:0", attn_implementation="flash_attention_2") 3. Initialise the TextStreamer streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) 4. Tokenise the inputs tokens = tokenizer( text, return_tensors='pt' ).input_ids.to("cuda:0") 5. Generate! generation_output = model.generate( tokens, streamer=streamer, max_new_tokens=512 ) That's it! 🤗

Vaibhav (VB) Srivastav

128,893 просмотров • 2 лет назад

This is my journey, I'm 4 years on E now and I changed a lot and I still want to change more for better, thank you for being with me these last 2 years that I've been on the platform, I want to make so much more for me and for you! Thank you and happy new year!

Sensitive content

This is my journey, I'm 4 years on E now and I changed a lot and I still want to change more for better, thank you for being with me these last 2 years that I've been on the platform, I want to make so much more for me and for you! Thank you and happy new year!

Amy Bunny

25,862 просмотров • 1 год назад

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Sakana AI

1,149,339 просмотров • 1 год назад

🎉 "This is the 20th anniversary of CUDA. We have been working on this architecture for 20 years ... to now have built up hundreds of millions of GPUs and computing systems around the world that run CUDA."

🎉 "This is the 20th anniversary of CUDA. We have been working on this architecture for 20 years ... to now have built up hundreds of millions of GPUs and computing systems around the world that run CUDA."

NVIDIA HPC Developer

27,642 просмотров • 3 месяцев назад

Autonomys is a purpose-built blockchain for AI infrastructure—a vertically integrated stack for decentralized storage and compute. ⛓️ Consensus: Built with the Substrate framework (sovereign L1, not part of Polkadot), powered by Proof-of-Archival-Storage. 🟦 Domains Architecture: Our modular execution layer for compute. The first Domain, Auto EVM, launched in July—permissionless smart contracts are live today. 🗺️ Roadmap: Permissionless instantiation of Domains, allowing developers to launch custom environments tailored to specific AI functions. Designed before AI became fashionable, Autonomys delivers the foundation that decentralized AI truly needs. 👉 Learn more:

Autonomys is a purpose-built blockchain for AI infrastructure—a vertically integrated stack for decentralized storage and compute. ⛓️ Consensus: Built with the Substrate framework (sovereign L1, not part of Polkadot), powered by Proof-of-Archival-Storage. 🟦 Domains Architecture: Our modular execution layer for compute. The first Domain, Auto EVM, launched in July—permissionless smart contracts are live today. 🗺️ Roadmap: Permissionless instantiation of Domains, allowing developers to launch custom environments tailored to specific AI functions. Designed before AI became fashionable, Autonomys delivers the foundation that decentralized AI truly needs. 👉 Learn more:

Autonomys | AI3.0

150,240 просмотров • 8 месяцев назад

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

52,511 просмотров • 1 месяц назад

“There are more companies than ideas by quite a bit. Compute is large enough such that it's not obvious that you need that much more compute to prove some idea. AlexNet was built on 2 GPUs. The transformer was built on 8 to 64 GPUs. Which would be, what, 2 GPUs of today? You could argue that o1 reasoning was not the most compute heavy thing in the world. For research, you definitely need some amount of compute, but it's far from obvious that you need the absolutely largest amount of compute. If everyone is within the same paradigm, then compute becomes one of the big differentiators.” Ilya Sutskever

“There are more companies than ideas by quite a bit. Compute is large enough such that it's not obvious that you need that much more compute to prove some idea. AlexNet was built on 2 GPUs. The transformer was built on 8 to 64 GPUs. Which would be, what, 2 GPUs of today? You could argue that o1 reasoning was not the most compute heavy thing in the world. For research, you definitely need some amount of compute, but it's far from obvious that you need the absolutely largest amount of compute. If everyone is within the same paradigm, then compute becomes one of the big differentiators.” Ilya Sutskever

Dwarkesh Patel

203,605 просмотров • 6 месяцев назад

good morning /v1/chat/completions This is a test we ran overnight on TensorRT-LLM with modified kernels serving a custom 1B parameter model we trained for a customer ~200ms end-to-end latency (not TTFB, full request). Beats their current Cerebras stack on latency and quality

good morning /v1/chat/completions This is a test we ran overnight on TensorRT-LLM with modified kernels serving a custom 1B parameter model we trained for a customer ~200ms end-to-end latency (not TTFB, full request). Beats their current Cerebras stack on latency and quality

Sam Hogan 🇺🇸

43,655 просмотров • 4 месяцев назад