Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

First steps for a specialized DeepSeek v4 Flash inference engine focused on inference quality / stability at different quantizations, with networked API that is batching capable. This is the 2 bit quants model running on my M3 Max 128GB.

antirez

68,217 subscribers

14,176 views • 1 month ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

$DeepSeek DeepSeek is now live on Heurist. Open source AI just reached new heights. Access DeepSeek-R1 and DeepSeek-V3 for smarter inference at a fraction of cost. Try it via Heurist API or explore on Pondera.$

DeepSeek DeepSeek is now live on Heurist. Open source AI just reached new heights. Access DeepSeek-R1 and DeepSeek-V3 for smarter inference at a fraction of cost. Try it via Heurist API or explore on Pondera.

Heurist

25,657 views • 1 year ago

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

AGI at home Running DeepSeek R1 across my 7 M4 Pro Mac Minis and 1 M4 Max MacBook Pro. Total unified memory = 496GB. Uses EXO Labs distributed inference with 4-bit quantization. Next goal is fp8 (requires >700GB)

Alex Cheema

1,934,652 views • 1 year ago

Running DeepSeek R1 on my desk Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run the full (671B, 8-bit) DeepSeek R1 distributed across 2 M3 Ultra 512GB Mac Studios (1TB total Unified Memory). Runs at 11 tok/sec. Theoretical max is ~20 tok/sec.

Running DeepSeek R1 on my desk Uses EXO Labs with Thunderbolt 5 interconnect (80Gbps) to run the full (671B, 8-bit) DeepSeek R1 distributed across 2 M3 Ultra 512GB Mac Studios (1TB total Unified Memory). Runs at 11 tok/sec. Theoretical max is ~20 tok/sec.

Alex Cheema

992,119 views • 1 year ago

Nemotron 3 Ultra is fast and genuinely good Compared it with 3 frontier models: DeepSeek V4, MiniMax M3, and Qwen 3.7 Max on 2 prompts very impressive results

Nemotron 3 Ultra is fast and genuinely good Compared it with 3 frontier models: DeepSeek V4, MiniMax M3, and Qwen 3.7 Max on 2 prompts very impressive results

GMI Cloud

224,474 views • 16 days ago

Batching for vision models is now available in Beta with our latest MLX engine update 👾 The updated engine also brings major improvements to caching for faster inference overall. Turn on Developer Mode, choose the beta runtime channel, and select LM Studio MLX v1.8.1.

Batching for vision models is now available in Beta with our latest MLX engine update 👾 The updated engine also brings major improvements to caching for faster inference overall. Turn on Developer Mode, choose the beta runtime channel, and select LM Studio MLX v1.8.1.

LM Studio

46,015 views • 1 month ago

Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations Georgi Gerganov ! This is a triumph.

Watching llama.cpp do 40 tok/s inference of the 7B model on my M2 Max, with 0% CPU usage, and using all 38 GPU cores. Congratulations Georgi Gerganov ! This is a triumph.

Nat Friedman

1,764,052 views • 3 years ago

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

How much faster is the new MacBook Pro for AI inference? M4 Max is 27% faster with 72 tok/sec compared to 56 tok/sec of the M3 Max with MLX running Gemma 2 9B (4bit). The 27% speedup is the same with Llama-3.2-1b, Llama-3.2-3b and others. Next up: EXO Labs M4 cluster.

Alex Cheema - e/acc

527,894 views • 1 year ago

Happy Friday! We just put DeepSeek-V4-Pro up on It’s the world’s largest open source model at 1.6T parameters, and you can run it for free running on NVIDIA Blackwell GPUs. Try the NVIDIA NIM API →

Happy Friday! We just put DeepSeek-V4-Pro up on It’s the world’s largest open source model at 1.6T parameters, and you can run it for free running on NVIDIA Blackwell GPUs. Try the NVIDIA NIM API →

NVIDIA AI

202,087 views • 1 month ago

Laika AI x Inference Labs Excited to announce our partnership with Inference Labs We're providing our real-time RAG & AI model API to Inference Labs, powering their verification infrastructure with live blockchain data. Inference Labs delivers open-source, trustless verification for AI agent outputs, so you can trust what you see—without relying on centralized gatekeepers.

Laika AI x Inference Labs Excited to announce our partnership with Inference Labs We're providing our real-time RAG & AI model API to Inference Labs, powering their verification infrastructure with live blockchain data. Inference Labs delivers open-source, trustless verification for AI agent outputs, so you can trust what you see—without relying on centralized gatekeepers.

Laika AI

13,727 views • 1 year ago

Another demo of the iPhone 17 Pro’s on-device LLM performance This time with Ling mini 2.0 by InclusionAI, a 16B MoE model with 1.4B active parameters running at ~120tk/s Thanks to Awni Hannun for the MLX DWQ 2-bit quants

Another demo of the iPhone 17 Pro’s on-device LLM performance This time with Ling mini 2.0 by InclusionAI, a 16B MoE model with 1.4B active parameters running at ~120tk/s Thanks to Awni Hannun for the MLX DWQ 2-bit quants

Adrien Grondin

46,205 views • 9 months ago

Introducing a 100% free coding agent with DeepSeek v4 Pro Choose any model, all free: - DeepSeek v4 Pro/Flash - Kimi K2.6 - MiniMax M2.7 npm i -g freebuff

Introducing a 100% free coding agent with DeepSeek v4 Pro Choose any model, all free: - DeepSeek v4 Pro/Flash - Kimi K2.6 - MiniMax M2.7 npm i -g freebuff

James Grugett

403,143 views • 1 month ago

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

Got continuous batching working with SSMs in mlx-lm. Here's four OpenCode agents simultaneously running Nvidia's Nemotron Nano on 64GB M4 Max. This is a nice model for smaller machines since it's MoE + hybrid attention (small cache).

Awni Hannun

35,078 views • 5 months ago

The world’s fastest inference for Llama 4 Scout is now on Poe! At over 2,600 tokens per second, this bot allows for near-instant interactions. (1/2)

The world’s fastest inference for Llama 4 Scout is now on Poe! At over 2,600 tokens per second, this bot allows for near-instant interactions. (1/2)

Poe

17,286 views • 1 year ago

Real-time Moondream inference using our new inference engine

Real-time Moondream inference using our new inference engine

vik

144,409 views • 2 months ago

Want to run Deepseek R1 ? Text-generation-inference v3.1.0 is out and supports it out of the box. Both on AMD and Nvidia !

Want to run Deepseek R1 ? Text-generation-inference v3.1.0 is out and supports it out of the box. Both on AMD and Nvidia !

Nicolas Patry

28,859 views • 1 year ago

Today, we're excited to announce a partnership with Manta Network (🔱,🔱), the Modular L2 solution that is transforming the landscape of ZK. Nesa is bringing private AI inference to the Manta ecosystem through a specialized collaboration. For the first time, developers on Manta can access Nesa's full library of AI models on-chain, and enjoy lightning fast, end-to-end private AI inference without ever leaving the Manta ecosystem. This means that dapps and protocols can now fuse with AI via smart contract on Manta. This integration is set to redefine the future of decentralized technology with AI. Stay tuned for more updates on how Nesa and Manta Network will be shaping the future of crypto together.

Today, we're excited to announce a partnership with Manta Network (🔱,🔱), the Modular L2 solution that is transforming the landscape of ZK. Nesa is bringing private AI inference to the Manta ecosystem through a specialized collaboration. For the first time, developers on Manta can access Nesa's full library of AI models on-chain, and enjoy lightning fast, end-to-end private AI inference without ever leaving the Manta ecosystem. This means that dapps and protocols can now fuse with AI via smart contract on Manta. This integration is set to redefine the future of decentralized technology with AI. Stay tuned for more updates on how Nesa and Manta Network will be shaping the future of crypto together.

Nesa

35,180 views • 1 year ago

My AI broke the world record on Tempest yesterday! But I still hold the human record :-) [on Extreme difficulty settings] Here's a little demo reel of the Tempest AI doing inference and training at the same time up on the hardest Tempest levels. This is all running on our Dell Technologies 7875 Workstation, with the 9995WX CPU handling 2000 fps of Tempest while the dual Blackwell RTX6000 GPUs do inference and training.

My AI broke the world record on Tempest yesterday! But I still hold the human record :-) [on Extreme difficulty settings] Here's a little demo reel of the Tempest AI doing inference and training at the same time up on the hardest Tempest levels. This is all running on our Dell Technologies 7875 Workstation, with the 9995WX CPU handling 2000 fps of Tempest while the dual Blackwell RTX6000 GPUs do inference and training.

Dave W Plummer

37,372 views • 3 months ago

Step 4 to achieve truly serverless GPUs for AI inference: skip over unserializable inference engine setup steps like CUDA graph capture and Torch compilation by stacking GPU snapshots and CPU snapshots.

Step 4 to achieve truly serverless GPUs for AI inference: skip over unserializable inference engine setup steps like CUDA graph capture and Torch compilation by stacking GPU snapshots and CPU snapshots.

Charles 🎉 Frye

17,384 views • 1 month ago

Deepseek V4 Flash is now free via Nous Portal for a limited time thanks to Novita AI!

Deepseek V4 Flash is now free via Nous Portal for a limited time thanks to Novita AI!

Nous Research

517,489 views • 1 month ago

NVIDIA Nemotron 3 Nano Omni, a new multimodal reasoning model, is now live on Jetson AI Lab and unifies vision, audio, and language into a single reasoning loop. 🙌 Power your NemoClaws by running this model with Ollama, vLLM and other inference frameworks on NVIDIA Jetson hardware. Try it ➡️

NVIDIA Nemotron 3 Nano Omni, a new multimodal reasoning model, is now live on Jetson AI Lab and unifies vision, audio, and language into a single reasoning loop. 🙌 Power your NemoClaws by running this model with Ollama, vLLM and other inference frameworks on NVIDIA Jetson hardware. Try it ➡️

NVIDIA Robotics

15,828 views • 1 month ago