Загрузка видео...

Не удалось загрузить видео

На главную

I implemented Google Research's TurboQuant as a CUDA-native compression engine on Blackwell B200. 5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory. 5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections) - online softmax -on-chip cache decompression - pipelined TMA...

805,934 просмотров • 2 месяцев назад •via X (Twitter)

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Introducing The AI CUDA Engineer: An agentic AI system that automates the production of highly optimized CUDA kernels. The AI CUDA Engineer can produce highly optimized CUDA kernels, reaching 10-100x speedup over common machine learning operations in PyTorch. Our system is also able to produce highly optimized CUDA kernels that are much faster than existing CUDA kernels commonly used in production. We believe that fundamentally, AI systems can and should be as resource-efficient as the human brain, and that the best path to achieve this efficiency is to use AI to make AI more efficient! We are excited to publish our paper, The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition. We also release a dataset of over 17,000 verified CUDA kernels produced by The AI CUDA Engineer. Paper: Kernel Archive Webpage: HuggingFace Dataset: The AI CUDA Engineer utilizes evolutionary LLM-driven code optimization to autonomously improve the runtime of machine learning operations. Our system is not only able to convert PyTorch code into CUDA kernels, but through the use of evolution, it can also optimize the runtime performance of CUDA kernels, fuse multiple operations, and even discover novel solutions for writing efficient CUDA operations by learning from past innovations! We believe The AI CUDA Engineer opens a new era of AI-driven acceleration of AI and automated inference time optimization. We (Robert Lange, Aaditya Prasad 🇺🇸, Suuun, Maxence Faldor, Yujin Tang, hardmaru) are excited to continue Sakana AI's mission of leveraging AI to improve AI.

Sakana AI

1,149,339 просмотров • 1 год назад

WHALE REPORT! 🐳🐳🐳🐳🐳🐳🐳🐳🐳🐳🐳 15 hours of research. JENSEN OF NVIDA WAS RIGHT! DeepSeek-V4 introduces several architectural and practical innovations that make it stand out, especially for long-context agentic workloads. The most geopolitically and technically notable aspect is its strong push toward hardware independence from NVIDIA’s CUDA ecosystem. The Big Story: Reduced/Limited Reliance on CUDA One of the most discussed “new things” is DeepSeek’s deliberate move toward CUDA independence (or at least strong dual-stack capability), driven by export controls and a push for sovereign AI stacks in China. •Training & Inference on Huawei Ascend: Huawei confirmed full support for V4 inference on its Ascend supernodes (e.g., Ascend 950/910 series) via CANN (Compute Architecture for Neural Networks), Huawei’s CUDA equivalent. DeepSeek co-optimized kernels directly with Huawei. Reports indicate fine-grained Expert Parallelism validated on Ascend NPUs, with speedups of 1.5–1.73x on non-NVIDIA platforms. •CANN Migration: The model (or key parts of its stack) was adapted/rewritten for CANN. This involved significant engineering effort, contributing to release delays. It represents a “full shift” or major de-NVIDIA-ization in the inference path for domestic deployments, proving frontier-scale MoE can run efficiently on alternative hardware. •Still Some CUDA Compatibility: Open-source components like the MegaMoE mega-kernel in DeepGEMM remain CUDA-based for NVIDIA users (with strong performance on Hopper/Blackwell). vLLM recipes support NVIDIA H200/B200 etc., and the weights run on CUDA setups. Earlier DeepSeek work (e.g., on V3) used low-level PTX to bypass some CUDA limitations for communication-heavy MoE. V4 builds on this hybrid pragmatism but emphasizes CANN-native paths. •Broader Implications: This validates a dual (or alternative) ecosystem. It reduces reliance on restricted NVIDIA tech for training/inference in China, potentially lowering costs and increasing resilience. Analysts see it as a blueprint for sovereign AI and a challenge to the “CUDA moat.”4635 Other notables include three-tier reasoning modes (Non-think / Think High / Think Max), Muon optimizer usage, heavy data curation (32T+ tokens, removing synthetic content), and MIT licensing for broad openness. In short, V4 isn’t just another bigger MoE but its efficiency tricks for usable million-token context, agent optimizations, and hardware-stack diversification (especially the CANN emphasis) mark it as a pragmatic leap for open, cost-effective, long-horizon AI. The CUDA/CANN angle is particularly significant in the current geopolitical context, showing how architectural and software innovations can mitigate hardware access barriers.50 Weights are on Hugging Face; API is live with competitive pricing. Quantized versions and community integrations are being released now. Last week Jensen Huang of Nvidia talked about why THE US MUST LEAD OPEN SOURCE. Welp we found out. Listen closely to this interview and the lack of knowledge the interviewer has. It is emblematic of the points of view permeating large AI companies. Perhaps you can hear Jensen now?

Brian Roemmele

70,873 просмотров • 1 месяц назад