Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Next mlx-vlm release will ship with continuous batching support on the server 🚀 What's coming: → Continuous batching — new requests join the active batch immediately, no waiting. Mixed image + text batches supported → OpenAI-compatible API — field-for-field match with mlx-lm, reasoning/content split for thinking models, tag-aware streaming... → Multi-turn tool calling — full tool use support across streaming and non-streaming, works with Gemma4 and other templates → Vision feature caching — cache image embeddings across turns. Gemma4: 228x speedup, Qwen3.5: 23x on cache hit All running locally on Apple Silicon. Check our this demo running 4 concurrent requests (mixed image + text) to gemma-4-26B-A4B-IT by Google Gemma in bf16 using Pi + MLX-VLM server on my M3 Ultra. One of the requests ingests a 8K resolution image!show more

Prince Canuma

21,897 subscribers

82,085 views • 2 months ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

Angelos Katharopoulos

66,095 views • 2 months ago

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Awni Hannun

186,641 views • 11 months ago

Hermes Agent Nous Research + Gemma 4 26B Google DeepMind, fully working on local Mac 🤯: one prompt → full CLI app + tests + pytest passing tool calls firing back to back — local Claude Code vibes Gemma 4 MoE is fast AND smart, and tool calling just works — powered by Rapid-MLX which natively parses Gemma 4's tool format (no other local backend can do this yet) Two lines to try it: pip install rapid-mlx & rapid-mlx serve gemma-4-26b pip install hermes-agent && hermes @teknaborat @siaborisov #HermesAgent #Gemma4 #LocalLLM

Hermes Agent Nous Research + Gemma 4 26B Google DeepMind, fully working on local Mac 🤯: one prompt → full CLI app + tests + pytest passing tool calls firing back to back — local Claude Code vibes Gemma 4 MoE is fast AND smart, and tool calling just works — powered by Rapid-MLX which natively parses Gemma 4's tool format (no other local backend can do this yet) Two lines to try it: pip install rapid-mlx & rapid-mlx serve gemma-4-26b pip install hermes-agent && hermes @teknaborat @siaborisov #HermesAgent #Gemma4 #LocalLLM

raullen

24,988 views • 2 months ago

RF-DETR by Roboflow now on MLX It can do realtime instance segmentation on-device and enable some cool use cases for visual analysis, monitoring and robotics like Reachy Mini. Also augmented VLM and VLA by preprocessing image and video with areas of interest. New release coming soon on mlx-vlm 🚀 For those who can’t wait you can install mlx-vlm from source.

RF-DETR by Roboflow now on MLX It can do realtime instance segmentation on-device and enable some cool use cases for visual analysis, monitoring and robotics like Reachy Mini. Also augmented VLM and VLA by preprocessing image and video with areas of interest. New release coming soon on mlx-vlm 🚀 For those who can’t wait you can install mlx-vlm from source.

Prince Canuma

30,499 views • 2 months ago

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Ronald Mannak

48,116 views • 2 months ago

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Pedro Cuenca

65,682 views • 2 months ago

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Red Hat AI

15,902 views • 16 days ago

As part of our goal to make MLX a great research tool, we're expanding support to new languages like Swift and C, making experimentation on Apple silicon easier for ML researchers. Video generating text with Mistral 7B and MLX Swift 👇 MLX is an array framework for machine learning research on Apple silicon. MLX is intended for research and not for production deployment of models in apps.

As part of our goal to make MLX a great research tool, we're expanding support to new languages like Swift and C, making experimentation on Apple silicon easier for ML researchers. Video generating text with Mistral 7B and MLX Swift 👇 MLX is an array framework for machine learning research on Apple silicon. MLX is intended for research and not for production deployment of models in apps.

Awni Hannun

88,741 views • 2 years ago

Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + LM Studio to review code and find bugs using Qwen 3.6 Powered by the updated MLX engine with batching in beta in the app The batching speed boost is noticeable

Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + LM Studio to review code and find bugs using Qwen 3.6 Powered by the updated MLX engine with batching in beta in the app The batching speed boost is noticeable

Adrien Grondin

74,729 views • 1 month ago

Hot tip for anyone doing AI dev: Use Ollama to easily run models like Deepseek-r1 or Gemma locally on your machine. It downloads them and spins up a server with an OpenAI SDK compatible API The smaller models are fast and good enough to work on new features or debug streaming without having to pay for API requests

Hot tip for anyone doing AI dev: Use Ollama to easily run models like Deepseek-r1 or Gemma locally on your machine. It downloads them and spins up a server with an OpenAI SDK compatible API The smaller models are fast and good enough to work on new features or debug streaming without having to pay for API requests

Wes Bos

153,435 views • 11 months ago

Qwen3.5-397B-A17B with Vision running on a Single M3 Ultra 🚀 Shout out to the Qwen team because they cooked with this model! Get started today: > uv pip install -U mlx-vlm Model collection 👇🏽 Repo:

Qwen3.5-397B-A17B with Vision running on a Single M3 Ultra 🚀 Shout out to the Qwen team because they cooked with this model! Get started today: > uv pip install -U mlx-vlm Model collection 👇🏽 Repo:

Prince Canuma

11,386 views • 4 months ago

🔴 The Pain: Running local MLX models is incredibly fast and private. But let's be real - testing tool calling via terminal is clunky, and there's zero good UI for it. 🟢 The Fix: rapid-mlx share Just ONE command gives you a polished web chat + seamless tool calling (works beautifully with gemma-4-12b-qat). We are proud to be the ONLY MLX inference engine in the community shipping this. ⚡️ 👇 Try it now: brew install raullenchai/rapid-mlx/rapid-mlx

🔴 The Pain: Running local MLX models is incredibly fast and private. But let's be real - testing tool calling via terminal is clunky, and there's zero good UI for it. 🟢 The Fix: rapid-mlx share Just ONE command gives you a polished web chat + seamless tool calling (works beautifully with gemma-4-12b-qat). We are proud to be the ONLY MLX inference engine in the community shipping this. ⚡️ 👇 Try it now: brew install raullenchai/rapid-mlx/rapid-mlx

raullen

22,680 views • 13 days ago

I'm exploring running LLMs locally on iPhones and Macs. I’ve got Llama 3.2 running locally using Apple’s MLX and with support for tool calling. This example runs two LLMs: one to identify tools to call based on the query and one to generate responses based on the tools’ outputs.

I'm exploring running LLMs locally on iPhones and Macs. I’ve got Llama 3.2 running locally using Apple’s MLX and with support for tool calling. This example runs two LLMs: one to identify tools to call based on the query and one to generate responses based on the tools’ outputs.

Simon B. Støvring

47,972 views • 1 year ago

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

Awni Hannun

66,547 views • 2 years ago

Your Mac is about to run inference like a datacenter. Coming soon to MLX-Swift: Continuous batching: the fastest way to handle multiple inference streams locally. It starts with regular inference and seamlessly upgrades to batched mode when new requests arrive. The best of both worlds. Based on the work of and Awni Hannun

Your Mac is about to run inference like a datacenter. Coming soon to MLX-Swift: Continuous batching: the fastest way to handle multiple inference streams locally. It starts with regular inference and seamlessly upgrades to batched mode when new requests arrive. The best of both worlds. Based on the work of and Awni Hannun

Ronald Mannak

60,570 views • 7 months ago

🔥 Thrilled to have worked with Google AI Developers on day-0 MLX support for Gemma 3 QAT! QAT optimizes models during training by simulating low-precision operations, delivering similar performance to FP16 and dramatic memory savings when quantised: • Gemma 3 27B: 54GB → 14.1GB (74% reduction) • Gemma 3 12B: 24GB → 6.6GB (72% reduction) • Gemma 3 4B: 8GB → 2.6GB (67% reduction) • Gemma 3 1B: 2GB → 0.5GB (75% reduction) Get started: > pip install mlx-vlm mlx-lm Model collection 👇🏽

🔥 Thrilled to have worked with Google AI Developers on day-0 MLX support for Gemma 3 QAT! QAT optimizes models during training by simulating low-precision operations, delivering similar performance to FP16 and dramatic memory savings when quantised: • Gemma 3 27B: 54GB → 14.1GB (74% reduction) • Gemma 3 12B: 24GB → 6.6GB (72% reduction) • Gemma 3 4B: 8GB → 2.6GB (67% reduction) • Gemma 3 1B: 2GB → 0.5GB (75% reduction) Get started: > pip install mlx-vlm mlx-lm Model collection 👇🏽

Prince Canuma

11,374 views • 1 year ago

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

Ivan Fioravanti ᯅ

24,155 views • 7 days ago

Run the latest Ministral 3 models by Mistral AI on macOS. Running privately on-device and optimized for Apple Silicon with MLX.

Run the latest Ministral 3 models by Mistral AI on macOS. Running privately on-device and optimized for Apple Silicon with MLX.

Locally AI - Local AI Chat

34,901 views • 5 months ago

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

Awni Hannun

42,761 views • 10 months ago

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Awni Hannun

95,050 views • 5 months ago