正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

introducing simple-llm: a ~950 line, powerful & extensible inference engine that performs on par with vllm. enjoy :) performance (gpt-oss-120b, on an h100): - batch=1: 135 tok/s (vllm: 138) - batch=64: 4,041 tok/s (vllm: 3,846)

naklecha

16,895 subscribers

59,730 次观看 • 5 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

stevibe

39,835 次观看 • 1 个月前

4×5090 lost badly in my last Ollama test But with vLLM + Nemotron 3 Super NVFP4, it came back swinging. 4×5090 - TTFT: 0.281s - Prompt eval: 608 tok/s - Generate eval: 110 tok/s PRO 6000 - TTFT: 0.123s - Prompt eval: 1390 tok/s - Generate eval: 81 tok/s So who wins? - PRO 6000 is still faster to respond - But 4×5090 now wins on raw generation speed

4×5090 lost badly in my last Ollama test But with vLLM + Nemotron 3 Super NVFP4, it came back swinging. 4×5090 - TTFT: 0.281s - Prompt eval: 608 tok/s - Generate eval: 110 tok/s PRO 6000 - TTFT: 0.123s - Prompt eval: 1390 tok/s - Generate eval: 81 tok/s So who wins? - PRO 6000 is still faster to respond - But 4×5090 now wins on raw generation speed

stevibe

16,974 次观看 • 3 个月前

vLLM fast inference running on monte carlo synthetic data generation with: > peak generation throughput of ~ 23k token/s & avg of 20k token/s > ~200 reqs/s > Qwen/Qwen2.5-0.5B-Instruct > on 1x Grace Hopper 200 - 480 GB(~96GB HBM3) vllm config: --max-num-seqs 512 --chunked-prefill-enabled (for better throughput) --dtype float16: (half precision for mem efficiency) --gpu-memory-utilization 0.95(for better KV Caching)

vLLM fast inference running on monte carlo synthetic data generation with: > peak generation throughput of ~ 23k token/s & avg of 20k token/s > ~200 reqs/s > Qwen/Qwen2.5-0.5B-Instruct > on 1x Grace Hopper 200 - 480 GB(~96GB HBM3) vllm config: --max-num-seqs 512 --chunked-prefill-enabled (for better throughput) --dtype float16: (half precision for mem efficiency) --gpu-memory-utilization 0.95(for better KV Caching)

Archie Sengupta

26,130 次观看 • 5 个月前

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

Charles 🎉 Frye

86,360 次观看 • 1 年前

The new gpt-oss-120b from OpenAI is now the best general purpose model to run on a Framework Desktop! ~40 tok/s on the MXFP4 version in LM Studio on Fedora 42.

The new gpt-oss-120b from OpenAI is now the best general purpose model to run on a Framework Desktop! ~40 tok/s on the MXFP4 version in LM Studio on Fedora 42.

Framework

129,524 次观看 • 10 个月前

gpt-oss 120b running at 50 tokens/s locally on my macbook

gpt-oss 120b running at 50 tokens/s locally on my macbook

ADAM

81,339 次观看 • 10 个月前

Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on vLLM GB10 @ 74W The future isn't 10,000 GPUs in a nuclear-powered data center. It’s 10 agents on your desk solving your problems while you make your coffee.

Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on vLLM GB10 @ 74W The future isn't 10,000 GPUs in a nuclear-powered data center. It’s 10 agents on your desk solving your problems while you make your coffee.

Mitko Vasilev

145,938 次观看 • 1 个月前

A full year of vLLM in 30 minutes by vLLM Lead from UC Berkeley, Simon Mo. Model and hardware usage trends, model architectures, API evolution, V1 engine rebuild, multimodal progress, expanding hardware support, and more. Plus how we are thinking about 2026. Enjoy!

A full year of vLLM in 30 minutes by vLLM Lead from UC Berkeley, Simon Mo. Model and hardware usage trends, model architectures, API evolution, V1 engine rebuild, multimodal progress, expanding hardware support, and more. Plus how we are thinking about 2026. Enjoy!

Red Hat AI

15,697 次观看 • 5 个月前

Today we announced a $150M seed round in Inferact, a new startup led by the maintainers of the vLLM project. Inferact will support the vLLM open source project through dedicated financial and developer resources and build what they see as the next generation commercial inference engine. Cofounders Simon Mo and Woosuk Kwon joined a16z GP Matt Bornstein for a conversation on how vLLM came to be, what they’ve learned from building it, and what they’re now doing at Inferact. They cover how vLLM began as a side project, why open source is essential to the world’s AI infrastructure, why inference is getting harder, and more. 00:00 Introduction 11:41 Community and collaboration in vLLM 19:19 Understanding inference engines 24:27 Cluster scale and GPU deployment 31:19 Belief in open source AI 35:45 Founding Inferact 40:00 The future of Inference at scale Simon Mo Woosuk Kwon Matt Bornstein

Today we announced a $150M seed round in Inferact, a new startup led by the maintainers of the vLLM project. Inferact will support the vLLM open source project through dedicated financial and developer resources and build what they see as the next generation commercial inference engine. Cofounders Simon Mo and Woosuk Kwon joined a16z GP Matt Bornstein for a conversation on how vLLM came to be, what they’ve learned from building it, and what they’re now doing at Inferact. They cover how vLLM began as a side project, why open source is essential to the world’s AI infrastructure, why inference is getting harder, and more. 00:00 Introduction 11:41 Community and collaboration in vLLM 19:19 Understanding inference engines 24:27 Cluster scale and GPU deployment 31:19 Belief in open source AI 35:45 Founding Inferact 40:00 The future of Inference at scale Simon Mo Woosuk Kwon Matt Bornstein

a16z

87,993 次观看 • 4 个月前

vllm-studio > claude-desktop officially confirmed

vllm-studio > claude-desktop officially confirmed

0xSero

23,270 次观看 • 1 个月前

InferenceMAX, vLLM TPU, compressed-tensors, MoE support via transformers, DeepSeek-OCR, and more. Here’s what’s new in the vLLM community over the past two weeks:

InferenceMAX, vLLM TPU, compressed-tensors, MoE support via transformers, DeepSeek-OCR, and more. Here’s what’s new in the vLLM community over the past two weeks:

Red Hat AI

24,429 次观看 • 7 个月前

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

How To Prompt

39,043 次观看 • 1 个月前

DeepSeek's (DeepSeek) latest—MLA, Multi-Token Prediction, 256 Experts, FP8 block quantization—shines with vLLM. Catch the office hours session were we discuss all the DeepSeek goodies and explore their integration and benchmarks with #vLLM.

DeepSeek's (DeepSeek) latest—MLA, Multi-Token Prediction, 256 Experts, FP8 block quantization—shines with vLLM. Catch the office hours session were we discuss all the DeepSeek goodies and explore their integration and benchmarks with #vLLM.

Red Hat AI

14,093 次观看 • 1 年前

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

stevibe

390,280 次观看 • 2 个月前

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

will brown

139,377 次观看 • 3 个月前

We co-led Inferact's $150M seed round to support them in their mission to build the inference engine for all current and future AI. In this episode of The Investment Memo, Lightspeed's Bucky Moore and James Alcorn sit down with Simon Mo (Co-Founder & CEO Inferact) to cover: - How vLLM grew to 60K+ GitHub stars - Why inference is shifting to the majority of compute - How vLLM evolved from a research project into the industry standard - Why building a company was the next step to push open-source inference forward 00:00 Introduction 02:03 The investment memo 04:47 Latency vs throughput vs cost 06:19 Paged attention explained 08:04 The evolution of attention 09:42 Growing the vLLM open source community 11:41 Working with hardware vendors 14:45 Deploying vLLM at large scale 16:03 Inferact's culture of openness 18:45 Building an open ecosystem and horizontal stack 19:45 Inferact's approach to fundraising 22:14 What is the future of inference? Simon Mo Bucky Moore James Alcorn

We co-led Inferact's $150M seed round to support them in their mission to build the inference engine for all current and future AI. In this episode of The Investment Memo, Lightspeed's Bucky Moore and James Alcorn sit down with Simon Mo (Co-Founder & CEO Inferact) to cover: - How vLLM grew to 60K+ GitHub stars - Why inference is shifting to the majority of compute - How vLLM evolved from a research project into the industry standard - Why building a company was the next step to push open-source inference forward 00:00 Introduction 02:03 The investment memo 04:47 Latency vs throughput vs cost 06:19 Paged attention explained 08:04 The evolution of attention 09:42 Growing the vLLM open source community 11:41 Working with hardware vendors 14:45 Deploying vLLM at large scale 16:03 Inferact's culture of openness 18:45 Building an open ecosystem and horizontal stack 19:45 Inferact's approach to fundraising 22:14 What is the future of inference? Simon Mo Bucky Moore James Alcorn

Lightspeed

27,040 次观看 • 4 个月前

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Ronald Mannak

48,116 次观看 • 2 个月前

Gemma4 12B with Unsloth's Quant on DGX Spark Quants: - UD_Q4_K_XL - UD_Q5_K_XL - UD_Q6_K_XL - UD_Q8_K_XL Summary: - Q4: 25.21 tok/s, TTFT 168ms - Q5: 21.7 tok/s, TTFT 182ms - Q6: 17.68 tok/s, TTFT 193.95ms - Q8: 15.22 tok/s, TTFT 221ms

Gemma4 12B with Unsloth's Quant on DGX Spark Quants: - UD_Q4_K_XL - UD_Q5_K_XL - UD_Q6_K_XL - UD_Q8_K_XL Summary: - Q4: 25.21 tok/s, TTFT 168ms - Q5: 21.7 tok/s, TTFT 182ms - Q6: 17.68 tok/s, TTFT 193.95ms - Q8: 15.22 tok/s, TTFT 221ms

stevibe

18,496 次观看 • 13 天前

DeepSeek-V4-Flash-Spark and Spark-Mini were born today. 1 command setup to Pi / vllm-studio / docker deployment. All this is running local on a single DGX Spark.

DeepSeek-V4-Flash-Spark and Spark-Mini were born today. 1 command setup to Pi / vllm-studio / docker deployment. All this is running local on a single DGX Spark.

0xSero

20,019 次观看 • 20 天前

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

🎉 Day-0 support for in vLLM, available today in v0.23.0! Congrats to Z.ai on GLM-5.2, a flagship model built for long-horizon coding agents. ✨ 1M-token context, built to hold project-scale engineering work in a single run ✨ Tuned for long-horizon coding: large-scale implementation, automated research, and performance optimization ✨ One task can carry a full dev workflow, from requirements to a deployable product across platforms ✨ Client-side and mobile engineering, including an on-device debugging loop Try it out running it on vLLM today: 🔗

vLLM

24,391 次观看 • 19 小时前