Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

introducing simple-llm: a ~950 line, powerful & extensible inference engine that performs on par with vllm. enjoy :) performance (gpt-oss-120b, on an h100): - batch=1: 135 tok/s (vllm: 138) - batch=64: 4,041 tok/s (vllm: 3,846)

naklecha

16,895 subscribers

59,730 views • 5 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

"Why are you benchmarking DGX Spark? It's a training box." Yeah. Low bandwidth, but 128GB of unified memory is just sitting there. Plenty of room to optimize. DGX Spark + Qwen3.6 27B. Four backend/quant combos: 🔴 llama.cpp + UD_Q4_K_XL > 11.0 tok/s (baseline), TTFT 297ms 🟢 llama.cpp + DFlash > 20.4 tok/s (peaks at 97 tok/s), TTFT 320ms 🟡 vLLM FP8 + MTP > 13.1 tok/s, TTFT 540ms 🟣 vLLM NVFP4 + MTP > 24.2 tok/s, TTFT 376ms NVFP4+MTP is the winner for me, rock stable around 24 tok/s, no wild swings. DFlash is the wildcard: massive peaks, but fluctuates a lot. FP8+MTP barely beats baseline, and it's FP8. Love my Spark.

stevibe

39,835 views • 1 month ago

4×5090 lost badly in my last Ollama test But with vLLM + Nemotron 3 Super NVFP4, it came back swinging. 4×5090 - TTFT: 0.281s - Prompt eval: 608 tok/s - Generate eval: 110 tok/s PRO 6000 - TTFT: 0.123s - Prompt eval: 1390 tok/s - Generate eval: 81 tok/s So who wins? - PRO 6000 is still faster to respond - But 4×5090 now wins on raw generation speed

4×5090 lost badly in my last Ollama test But with vLLM + Nemotron 3 Super NVFP4, it came back swinging. 4×5090 - TTFT: 0.281s - Prompt eval: 608 tok/s - Generate eval: 110 tok/s PRO 6000 - TTFT: 0.123s - Prompt eval: 1390 tok/s - Generate eval: 81 tok/s So who wins? - PRO 6000 is still faster to respond - But 4×5090 now wins on raw generation speed

stevibe

16,974 views • 3 months ago

vLLM fast inference running on monte carlo synthetic data generation with: > peak generation throughput of ~ 23k token/s & avg of 20k token/s > ~200 reqs/s > Qwen/Qwen2.5-0.5B-Instruct > on 1x Grace Hopper 200 - 480 GB(~96GB HBM3) vllm config: --max-num-seqs 512 --chunked-prefill-enabled (for better throughput) --dtype float16: (half precision for mem efficiency) --gpu-memory-utilization 0.95(for better KV Caching)

vLLM fast inference running on monte carlo synthetic data generation with: > peak generation throughput of ~ 23k token/s & avg of 20k token/s > ~200 reqs/s > Qwen/Qwen2.5-0.5B-Instruct > on 1x Grace Hopper 200 - 480 GB(~96GB HBM3) vllm config: --max-num-seqs 512 --chunked-prefill-enabled (for better throughput) --dtype float16: (half precision for mem efficiency) --gpu-memory-utilization 0.95(for better KV Caching)

Archie Sengupta

26,130 views • 5 months ago

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

Dozens of teams have asked my advice on running LLMs. How fast is DeepSeek V3 with vLLM on 8 GPUs? What's the max throughput of Qwen 2.5 Coder with SGLang on one H100? Running & sharing benchmarks ad hoc was too slow So we built a tiny app, the LLM Engine Advisor

Charles 🎉 Frye

86,360 views • 1 year ago

The new gpt-oss-120b from OpenAI is now the best general purpose model to run on a Framework Desktop! ~40 tok/s on the MXFP4 version in LM Studio on Fedora 42.

The new gpt-oss-120b from OpenAI is now the best general purpose model to run on a Framework Desktop! ~40 tok/s on the MXFP4 version in LM Studio on Fedora 42.

Framework

129,524 views • 10 months ago

gpt-oss 120b running at 50 tokens/s locally on my macbook

gpt-oss 120b running at 50 tokens/s locally on my macbook

ADAM

81,339 views • 10 months ago

Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on vLLM GB10 @ 74W The future isn't 10,000 GPUs in a nuclear-powered data center. It’s 10 agents on your desk solving your problems while you make your coffee.

Ran Google’s cookbook with 10 agents on my tiny GB10 GPU. 436 tok/s / 43.6 per agent Qwen3.6-35B + Dflash + DDTree on vLLM GB10 @ 74W The future isn't 10,000 GPUs in a nuclear-powered data center. It’s 10 agents on your desk solving your problems while you make your coffee.

Mitko Vasilev

145,938 views • 1 month ago

A full year of vLLM in 30 minutes by vLLM Lead from UC Berkeley, Simon Mo. Model and hardware usage trends, model architectures, API evolution, V1 engine rebuild, multimodal progress, expanding hardware support, and more. Plus how we are thinking about 2026. Enjoy!

A full year of vLLM in 30 minutes by vLLM Lead from UC Berkeley, Simon Mo. Model and hardware usage trends, model architectures, API evolution, V1 engine rebuild, multimodal progress, expanding hardware support, and more. Plus how we are thinking about 2026. Enjoy!

Red Hat AI

15,697 views • 5 months ago

Today we announced a $150M seed round in Inferact, a new startup led by the maintainers of the vLLM project. Inferact will support the vLLM open source project through dedicated financial and developer resources and build what they see as the next generation commercial inference engine. Cofounders Simon Mo and Woosuk Kwon joined a16z GP Matt Bornstein for a conversation on how vLLM came to be, what they’ve learned from building it, and what they’re now doing at Inferact. They cover how vLLM began as a side project, why open source is essential to the world’s AI infrastructure, why inference is getting harder, and more. 00:00 Introduction 11:41 Community and collaboration in vLLM 19:19 Understanding inference engines 24:27 Cluster scale and GPU deployment 31:19 Belief in open source AI 35:45 Founding Inferact 40:00 The future of Inference at scale Simon Mo Woosuk Kwon Matt Bornstein

Today we announced a $150M seed round in Inferact, a new startup led by the maintainers of the vLLM project. Inferact will support the vLLM open source project through dedicated financial and developer resources and build what they see as the next generation commercial inference engine. Cofounders Simon Mo and Woosuk Kwon joined a16z GP Matt Bornstein for a conversation on how vLLM came to be, what they’ve learned from building it, and what they’re now doing at Inferact. They cover how vLLM began as a side project, why open source is essential to the world’s AI infrastructure, why inference is getting harder, and more. 00:00 Introduction 11:41 Community and collaboration in vLLM 19:19 Understanding inference engines 24:27 Cluster scale and GPU deployment 31:19 Belief in open source AI 35:45 Founding Inferact 40:00 The future of Inference at scale Simon Mo Woosuk Kwon Matt Bornstein

a16z

87,993 views • 4 months ago

vllm-studio > claude-desktop officially confirmed

vllm-studio > claude-desktop officially confirmed

0xSero

23,270 views • 1 month ago

InferenceMAX, vLLM TPU, compressed-tensors, MoE support via transformers, DeepSeek-OCR, and more. Here’s what’s new in the vLLM community over the past two weeks:

InferenceMAX, vLLM TPU, compressed-tensors, MoE support via transformers, DeepSeek-OCR, and more. Here’s what’s new in the vLLM community over the past two weeks:

Red Hat AI

24,429 views • 7 months ago

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

Someone built a free and better alternative to Claude that runs 100% locally. → works with any LLM (Claude, GPT, Gemini, vLLM) → beats it on deep research → has Cowork-like capabilities → 50+ connectors out of the box → deploy in literally one command 100% open source. MIT license. 28k stars.

How To Prompt

39,043 views • 1 month ago

DeepSeek's (DeepSeek) latest—MLA, Multi-Token Prediction, 256 Experts, FP8 block quantization—shines with vLLM. Catch the office hours session were we discuss all the DeepSeek goodies and explore their integration and benchmarks with #vLLM.

DeepSeek's (DeepSeek) latest—MLA, Multi-Token Prediction, 256 Experts, FP8 block quantization—shines with vLLM. Catch the office hours session were we discuss all the DeepSeek goodies and explore their integration and benchmarks with #vLLM.

Red Hat AI

14,093 views • 1 year ago

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

Qwen3.6 35B-A3B dropped yesterday, so I ran it on 4 GPUs to see how it performs: 🟣 RTX 3090 — 49.78 tok/s, TTFT 852ms 🟡 RTX 4090 — 118.93 tok/s, TTFT 686ms 🟢 RTX 5090 — 160.37 tok/s, TTFT 409ms 🔵 DGX Spark — 59.98 tok/s, TTFT 228ms I went with ollama as the backend because honestly, it's the easiest way for most people to get started. One command, model pulled, done. I used Q4_K_M (24GB) across all four cards. The reason is the 3090 and 4090 don't support NVFP4 (only the 5090 and DGX Spark could use it). Keeping the same quant everywhere felt like the fairest way to compare. And yes, you can absolutely squeeze more performance out of every card with vLLM, SGLang, or TensorRT-LLM. But that's not what this test is about. This is just the out-of-the-box experience for folks who own a GPU and want to try the new model tonight.

stevibe

390,280 views • 2 months ago

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

you can create complex agentic environments and launch RL training runs with a single prompt. deploy trained inference endpoints with a single click. no GPUs, no SSH, no vLLM. just `prime`. guide:

will brown

139,377 views • 3 months ago

We co-led Inferact's $150M seed round to support them in their mission to build the inference engine for all current and future AI. In this episode of The Investment Memo, Lightspeed's Bucky Moore and James Alcorn sit down with Simon Mo (Co-Founder & CEO Inferact) to cover: - How vLLM grew to 60K+ GitHub stars - Why inference is shifting to the majority of compute - How vLLM evolved from a research project into the industry standard - Why building a company was the next step to push open-source inference forward 00:00 Introduction 02:03 The investment memo 04:47 Latency vs throughput vs cost 06:19 Paged attention explained 08:04 The evolution of attention 09:42 Growing the vLLM open source community 11:41 Working with hardware vendors 14:45 Deploying vLLM at large scale 16:03 Inferact's culture of openness 18:45 Building an open ecosystem and horizontal stack 19:45 Inferact's approach to fundraising 22:14 What is the future of inference? Simon Mo Bucky Moore James Alcorn

We co-led Inferact's $150M seed round to support them in their mission to build the inference engine for all current and future AI. In this episode of The Investment Memo, Lightspeed's Bucky Moore and James Alcorn sit down with Simon Mo (Co-Founder & CEO Inferact) to cover: - How vLLM grew to 60K+ GitHub stars - Why inference is shifting to the majority of compute - How vLLM evolved from a research project into the industry standard - Why building a company was the next step to push open-source inference forward 00:00 Introduction 02:03 The investment memo 04:47 Latency vs throughput vs cost 06:19 Paged attention explained 08:04 The evolution of attention 09:42 Growing the vLLM open source community 11:41 Working with hardware vendors 14:45 Deploying vLLM at large scale 16:03 Inferact's culture of openness 18:45 Building an open ecosystem and horizontal stack 19:45 Inferact's approach to fundraising 22:14 What is the future of inference? Simon Mo Bucky Moore James Alcorn

Lightspeed

27,040 views • 4 months ago

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Ronald Mannak

48,116 views • 2 months ago

Gemma4 12B with Unsloth's Quant on DGX Spark Quants: - UD_Q4_K_XL - UD_Q5_K_XL - UD_Q6_K_XL - UD_Q8_K_XL Summary: - Q4: 25.21 tok/s, TTFT 168ms - Q5: 21.7 tok/s, TTFT 182ms - Q6: 17.68 tok/s, TTFT 193.95ms - Q8: 15.22 tok/s, TTFT 221ms

Gemma4 12B with Unsloth's Quant on DGX Spark Quants: - UD_Q4_K_XL - UD_Q5_K_XL - UD_Q6_K_XL - UD_Q8_K_XL Summary: - Q4: 25.21 tok/s, TTFT 168ms - Q5: 21.7 tok/s, TTFT 182ms - Q6: 17.68 tok/s, TTFT 193.95ms - Q8: 15.22 tok/s, TTFT 221ms

stevibe

18,459 views • 12 days ago

DeepSeek-V4-Flash-Spark and Spark-Mini were born today. 1 command setup to Pi / vllm-studio / docker deployment. All this is running local on a single DGX Spark.

DeepSeek-V4-Flash-Spark and Spark-Mini were born today. 1 command setup to Pi / vllm-studio / docker deployment. All this is running local on a single DGX Spark.

0xSero

20,019 views • 19 days ago

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

We’re thrilled to open-source TriAttention! 🚀 🦞 Deploy OpenClaw (32B LLM) on a single 24GB RTX 4090 locally 💻Full code open-source & vLLM-ready for one-click deployment ⚡️ 2.5× faster inference speed & 10.7× less KV cache memory usage TriAttention is a novel KV cache compression method built on rigorous trigonometric analysis in the Pre‑RoPE space for efficient LLM long reasoning. Github Repo: Paper Link: Homepage:

Yukang Chen

197,064 views • 2 months ago