正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Laguna XS.2 from Poolside is a 33B MoE built for agentic coding. Red Hat AI trained a DFlash speculator for it: 0.6B drafter, 8 tokens per pass, no quality loss. FP8, NVFP4, and INT4 checkpoints via LLM Compressor. Models in comments. Speedup with vLLM:

Red Hat AI

10,962 subscribers

20,411 次观看 • 17 天前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

What compression looks like on vLLM. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 Sawyer Bowerman for the 2-minute demo.

Red Hat AI

34,069 次观看 • 2 个月前

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

stevibe

39,835 次观看 • 1 个月前

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. I'll soon publish another article on speculative decoding. Stay tuned!!

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. I'll soon publish another article on speculative decoding. Stay tuned!!

Akshay 🚀

66,052 次观看 • 4 天前

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Researchers found a way to make LLMs 8.5x faster! (without compromising accuracy) Speculative decoding is quite an effective way to address the single-token bottleneck in traditional LLM inference. A small "draft" model first generates the next several tokens, then the large model verifies all of them at once in a single forward pass. If a token at any position is wrong, you keep everything before it and restart from there. This never does worse than normal decoding. But current drafters in Speculative decoding still guess one token at a time. That makes the drafting step itself a bottleneck, capping real-world speedups at 2-3x. DFlash is a new technique that swaps the autoregressive drafter with a lightweight block diffusion model that guesses all tokens in one parallel shot. Drafting cost stays flat no matter how many tokens you speculate. On top of that, the drafter is conditioned on hidden features pulled from multiple layers of the target model and injected into every draft layer, so it makes significantly better guesses than a drafter working from scratch. In the side-by-side demo below, vanilla decoding runs at 48.5 tokens/sec. DFlash hits 415 tokens/sec on the same model, with zero quality loss. It's already integrated with vLLM, SGLang, and Transformers, with draft models on HuggingFace for several models like Qwen3, Qwen3.5, Llama 3.1, Kimi-K2.5, gpt-oss, and many more. I have shared the GitHub repo in the replies! KV caching is another must-know technique to boost LLM inference. I recently wrote an article about it. Read it below. 👉 Over to you: What use case are you working on that can benefit from this new technique?

Avi Chawla

156,765 次观看 • 1 个月前

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

DFlash speculative decoding on Apple Silicon Qwen3.5-9B bf16 · M5 Max · greedy exact match ▸ 85 tok/s, 3.3× at 1024 tokens (runtime) ▸ ~70 tok/s, 2.6× in the video (terminal I/O overhead) ▸ 80 tok/s, 3.1× at 2048 tokens (runtime) Currently working on: → Long context (speedup degrades past 4K tokens, KV cache growth) → Int4 quantized models (27B class) Built on MLX, no CUDA, single machine. Draft generates 16 tokens in parallel, target verifies in one forward pass. Will open source when ready.

bstn 👁️

36,888 次观看 • 2 个月前

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:

The new 1 Trillion parameter Kimi K2 Thinking model runs well on 2 M3 Ultras in its native format - no loss in quality! The model was quantization aware trained (qat) at int4. Here it generated ~3500 tokens at 15 toks/sec using pipeline-parallelism in mlx-lm:

Awni Hannun

500,639 次观看 • 7 个月前

DeepSeek R1 (the full 680B model) runs nicely in higher quality 4-bit on 3 M2 Ultras with MLX. Asked it a coding question and it thought for ~2k tokens and generated 3500 tokens overall:

DeepSeek R1 (the full 680B model) runs nicely in higher quality 4-bit on 3 M2 Ultras with MLX. Asked it a coding question and it thought for ~2k tokens and generated 3500 tokens overall:

Awni Hannun

997,195 次观看 • 1 年前

Finally got a chance to play around with Andrej Karpathy's LLM Council. I built it as a plugin inside of Claude Code. Hooked it up with OpenRouter for models. The AskUserQuestion tool came in handy to select the council and chairman. This is my first test, but I agree with Karpathy that the concept of LLM ensembles can be used beyond models that offer perspectives on interesting questions. I feel like this could have really cool applications in agentic coding. More on that soon. I built this as a plugin, so next I will be exploring other user cases around agentic coding, like evaluation, tool building, designing, and research. If there is enough interest, I will clean it up and push it out as an open plugin.

Finally got a chance to play around with Andrej Karpathy's LLM Council. I built it as a plugin inside of Claude Code. Hooked it up with OpenRouter for models. The AskUserQuestion tool came in handy to select the council and chairman. This is my first test, but I agree with Karpathy that the concept of LLM ensembles can be used beyond models that offer perspectives on interesting questions. I feel like this could have really cool applications in agentic coding. More on that soon. I built this as a plugin, so next I will be exploring other user cases around agentic coding, like evaluation, tool building, designing, and research. If there is enough interest, I will clean it up and push it out as an open plugin.

elvis

79,648 次观看 • 5 个月前

8. DGX Spark = personal AI supercomputer Imagine training your own LLM at your desk. DGX Spark is the dev-friendly AI rig built for workstation scale. → Shipping via ASUS, Dell, MSI → Trains large models locally → No server farm needed

8. DGX Spark = personal AI supercomputer Imagine training your own LLM at your desk. DGX Spark is the dev-friendly AI rig built for workstation scale. → Shipping via ASUS, Dell, MSI → Trains large models locally → No server farm needed

Shruti

35,907 次观看 • 1 年前

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

will brown

139,377 次观看 • 3 个月前

Here’s how to build a Slack clone with Claude 3.7 Sonnet and Cursor Agent in v0.46. Perfect for *anyone* wanting to pick up AI coding tools. Learn how to build an app 100% from scratch with the newest AI coding workflows. Watch for 1hr+ of agentic coding.

Here’s how to build a Slack clone with Claude 3.7 Sonnet and Cursor Agent in v0.46. Perfect for anyone wanting to pick up AI coding tools. Learn how to build an app 100% from scratch with the newest AI coding workflows. Watch for 1hr+ of agentic coding.

Mckay Wrigley

212,403 次观看 • 1 年前

✨ Meet our new open family of models: NVIDIA Nemotron 3 Open in weights, data, tools, and training, Nemotron 3 is built for multi-agent apps and features: • An efficient hybrid Mamba‑Transformer MoE architecture • 1M token context for long-term memory and improved reasoning • Multi‑environment reinforcement learning via NeMo Gym for advanced skill adaptation Plus NVFP4 pre-training, latent MoE, 1T tokens of data, and more. 📗Read the details in our tech blog: 🤗 Try the model on Hugging Face:

✨ Meet our new open family of models: NVIDIA Nemotron 3 Open in weights, data, tools, and training, Nemotron 3 is built for multi-agent apps and features: • An efficient hybrid Mamba‑Transformer MoE architecture • 1M token context for long-term memory and improved reasoning • Multi‑environment reinforcement learning via NeMo Gym for advanced skill adaptation Plus NVFP4 pre-training, latent MoE, 1T tokens of data, and more. 📗Read the details in our tech blog: 🤗 Try the model on Hugging Face:

NVIDIA AI Developer

68,500 次观看 • 6 个月前

Scale alone is not enough for AI data. Quality and complexity are equally critical. Excited to support all of these for LLM developers with Snorkel AI Data-as-a-Service, and to share our new leaderboard! — Our decade-plus of research and work in AI data has a simple point: scale alone is not enough. AI success is all about the quality, complexity, and distribution of data—in addition to volume. We’re excited to be powering leading LLM developers with Snorkel AI Expert Data-as-a-Service, our white glove service for custom, expert-level AI datasets—and to now preview some of what we’re building via our new Expert Data Leaderboard (🔗 in 🧵) + upcoming OSS dataset releases! Snorkel Expert Data-as-a-Service is built to meet the rapidly evolving data needs of the agentic AI world—where success is built on the quality, complexity, and distribution of datasets, in addition to size and scale. This kind of high-quality, frontier AI data can only come from a union of technology and human expertise. With Snorkel Expert Data-as-a-Service, we’re powering frontier LLM developers across agentic, expert knowledge, reasoning, coding, multi-modal, and other task types via the combination of these two key components: - (1) The Snorkel Expert Network: A global team of subject matter experts focused wholly on specialized knowledge–spanning thousands of topics in STEM/academic, vertical/professional, and consumer/lifestyle domains. - (2) Snorkel AI Data Development Platform: Our unique programmatic data curation and quality control platform, accelerating and improving expert authoring and review through principled techniques developed over the last decade of R&D. Now: we’re incredibly excited to showcase some of the power of Snorkel Expert Data-as-a-Service via the new Snorkel Leaderboard—putting frontier models to the test in complex, agentic, and reasoning settings inspired by real industry scenarios (not esoteric puzzles)! We’ll be releasing new leaderboards and accompanying expert-verified open source datasets (coming soon!) regularly. To start, we’re sharing three initial ones in preview: - SnorkelFinance: Q&A over financial documents requiring agentic tool-calling and reasoning - SnorkelUnderwrite: Agentic insurance tasks requiring industry-specific reasoning and tool use - SnorkelSequences: Mathematical tasks requiring compositional multi-step reasoning

Scale alone is not enough for AI data. Quality and complexity are equally critical. Excited to support all of these for LLM developers with Snorkel AI Data-as-a-Service, and to share our new leaderboard! — Our decade-plus of research and work in AI data has a simple point: scale alone is not enough. AI success is all about the quality, complexity, and distribution of data—in addition to volume. We’re excited to be powering leading LLM developers with Snorkel AI Expert Data-as-a-Service, our white glove service for custom, expert-level AI datasets—and to now preview some of what we’re building via our new Expert Data Leaderboard (🔗 in 🧵) + upcoming OSS dataset releases! Snorkel Expert Data-as-a-Service is built to meet the rapidly evolving data needs of the agentic AI world—where success is built on the quality, complexity, and distribution of datasets, in addition to size and scale. This kind of high-quality, frontier AI data can only come from a union of technology and human expertise. With Snorkel Expert Data-as-a-Service, we’re powering frontier LLM developers across agentic, expert knowledge, reasoning, coding, multi-modal, and other task types via the combination of these two key components: - (1) The Snorkel Expert Network: A global team of subject matter experts focused wholly on specialized knowledge–spanning thousands of topics in STEM/academic, vertical/professional, and consumer/lifestyle domains. - (2) Snorkel AI Data Development Platform: Our unique programmatic data curation and quality control platform, accelerating and improving expert authoring and review through principled techniques developed over the last decade of R&D. Now: we’re incredibly excited to showcase some of the power of Snorkel Expert Data-as-a-Service via the new Snorkel Leaderboard—putting frontier models to the test in complex, agentic, and reasoning settings inspired by real industry scenarios (not esoteric puzzles)! We’ll be releasing new leaderboards and accompanying expert-verified open source datasets (coming soon!) regularly. To start, we’re sharing three initial ones in preview: - SnorkelFinance: Q&A over financial documents requiring agentic tool-calling and reasoning - SnorkelUnderwrite: Agentic insurance tasks requiring industry-specific reasoning and tool use - SnorkelSequences: Mathematical tasks requiring compositional multi-step reasoning

Alex Ratner

495,823 次观看 • 1 年前

running Qwen3.5 397B MoE (17B active/token) on 4x DGX Sparks in FP8 (~400GB) > OpenCode driving > agent exploring its own config > probing all 4 Sparks (via ssh) + reporting thermals > inspecting how vLLM is serving it > collecting + analyzing its own stats local AI is awesome

running Qwen3.5 397B MoE (17B active/token) on 4x DGX Sparks in FP8 (~400GB) > OpenCode driving > agent exploring its own config > probing all 4 Sparks (via ssh) + reporting thermals > inspecting how vLLM is serving it > collecting + analyzing its own stats local AI is awesome

Ahmad

121,691 次观看 • 2 个月前

Kimi K2 Thinking is a bigger deal than I thought! I just ran a quick eval on a deep agent I built for customer support. It's on par with GPT-5; no other LLM has reached this level of agentic, orchestration, and reasoning capabilities. Huge for agentic and reasoning tasks.

Kimi K2 Thinking is a bigger deal than I thought! I just ran a quick eval on a deep agent I built for customer support. It's on par with GPT-5; no other LLM has reached this level of agentic, orchestration, and reasoning capabilities. Huge for agentic and reasoning tasks.

elvis

228,508 次观看 • 7 个月前

🚨 Big news for AI innovation: Claude Opus 4 and Claude Sonnet 4, Anthropic's most advanced models, are now available in Amazon Bedrock. These powerful models offer hybrid reasoning, 200K token context windows, and are designed for AI agents. From financial analysis to high-quality writing, to enhanced reasoning, coding, agentic capabilities and more—all with the enterprise-grade security of Amazon Web Services.

🚨 Big news for AI innovation: Claude Opus 4 and Claude Sonnet 4, Anthropic's most advanced models, are now available in Amazon Bedrock. These powerful models offer hybrid reasoning, 200K token context windows, and are designed for AI agents. From financial analysis to high-quality writing, to enhanced reasoning, coding, agentic capabilities and more—all with the enterprise-grade security of Amazon Web Services.

Amazon

123,024 次观看 • 1 年前

STRIPE NOW LETS YOU PICK YOUR AI MODELS, SET YOUR MARKUP, AND BILL CUSTOMERS FOR LLM TOKENS AUTOMATICALLY. BUILDING AN AI WRAPPER BUSINESS JUST GOT A LOT EASIER.

STRIPE NOW LETS YOU PICK YOUR AI MODELS, SET YOUR MARKUP, AND BILL CUSTOMERS FOR LLM TOKENS AUTOMATICALLY. BUILDING AN AI WRAPPER BUSINESS JUST GOT A LOT EASIER.

0xMarioNawfal

421,712 次观看 • 3 个月前

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Red Hat AI

15,864 次观看 • 12 天前

This is it for AI coding.. I made this recipe app with a built-in TikTok feed with just 2 lines of prompts. And anyone (yes, anyone) can do the same. I will show you how (Bookmark for later):

This is it for AI coding.. I made this recipe app with a built-in TikTok feed with just 2 lines of prompts. And anyone (yes, anyone) can do the same. I will show you how (Bookmark for later):

Rez Karim

31,353 次观看 • 11 个月前

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

Long video generation is a systems problem. Introducing LongLive-2.0 from NVIDIA Research: an end-to-end NVFP4 training and inference system for long video generation. Low-precision deployment often relies on post-training quantization, creating a gap between how models are trained and how they run. LongLive-2.0 aligns NVFP4-aware training, distillation, and W4A4 inference, maintaining strong benchmark quality while improving speed and memory efficiency.

NVIDIA AI

60,291 次观看 • 24 天前