Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Next mlx-vlm release will ship with continuous batching support on the server 🚀 What's coming: → Continuous batching — new requests join the active batch immediately, no waiting. Mixed image + text batches supported → OpenAI-compatible API — field-for-field match with mlx-lm, reasoning/content split for thinking models, tag-aware streaming... → Multi-turn tool calling — full tool use support across streaming and non-streaming, works with Gemma4 and other templates → Vision feature caching — cache image embeddings across turns. Gemma4: 228x speedup, Qwen3.5: 23x on cache hit All running locally on Apple Silicon. Check our this demo running 4 concurrent requests (mixed image + text) to gemma-4-26B-A4B-IT by Google Gemma in bf16 using Pi + MLX-VLM server on my M3 Ultra. One of the requests ingests a 8K resolution image!show more

Prince Canuma

21,897 subscribers

82,085 Aufrufe • vor 2 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

A long time coming but new mlx-lm is here with better batching support in the server and Gemma 4. pip install -U mlx-lm Here is a video where a single M3 Ultra serves 5 opencode sessions with Gemma 4 26B that process ~130k tokens in ~1.5 minutes.

Angelos Katharopoulos

66,089 Aufrufe • vor 2 Monaten

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

A perfect coding model for MLX on Apple silicon.. Qwen delivered again. Runs quite fast on an M3 Ultra. Running the 4-bit quantized with mlx-lm:

Awni Hannun

186,641 Aufrufe • vor 10 Monaten

Hermes Agent Nous Research + Gemma 4 26B Google DeepMind, fully working on local Mac 🤯: one prompt → full CLI app + tests + pytest passing tool calls firing back to back — local Claude Code vibes Gemma 4 MoE is fast AND smart, and tool calling just works — powered by Rapid-MLX which natively parses Gemma 4's tool format (no other local backend can do this yet) Two lines to try it: pip install rapid-mlx & rapid-mlx serve gemma-4-26b pip install hermes-agent && hermes @teknaborat @siaborisov #HermesAgent #Gemma4 #LocalLLM

Hermes Agent Nous Research + Gemma 4 26B Google DeepMind, fully working on local Mac 🤯: one prompt → full CLI app + tests + pytest passing tool calls firing back to back — local Claude Code vibes Gemma 4 MoE is fast AND smart, and tool calling just works — powered by Rapid-MLX which natively parses Gemma 4's tool format (no other local backend can do this yet) Two lines to try it: pip install rapid-mlx & rapid-mlx serve gemma-4-26b pip install hermes-agent && hermes @teknaborat @siaborisov #HermesAgent #Gemma4 #LocalLLM

raullen

24,988 Aufrufe • vor 2 Monaten

RF-DETR by Roboflow now on MLX It can do realtime instance segmentation on-device and enable some cool use cases for visual analysis, monitoring and robotics like Reachy Mini. Also augmented VLM and VLA by preprocessing image and video with areas of interest. New release coming soon on mlx-vlm 🚀 For those who can’t wait you can install mlx-vlm from source.

RF-DETR by Roboflow now on MLX It can do realtime instance segmentation on-device and enable some cool use cases for visual analysis, monitoring and robotics like Reachy Mini. Also augmented VLM and VLA by preprocessing image and video with areas of interest. New release coming soon on mlx-vlm 🚀 For those who can’t wait you can install mlx-vlm from source.

Prince Canuma

30,499 Aufrufe • vor 2 Monaten

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Apple Silicon + Gemma 4 fans: this is for you. Pico AI Server now supports continuous batching with MLX-Swift. 43 tok/s on 1 stream. 26 tok/s per stream on 2 concurrent streams. That’s 52 tok/s total. a 21% throughput gain on a six-year-old MacBook Pro M1 Max!

Ronald Mannak

48,116 Aufrufe • vor 2 Monaten

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Kimi K2.6 was released 1h ago, and it looks amazing! Here it's running with MLX (mlx-vlm) on two M3 Ultras (full 1T param VLM) 🔥

Pedro Cuenca

65,682 Aufrufe • vor 1 Monat

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Gemma 4 12B dropped today. Apache 2.0, multimodal: text, image, audio, and video. 256K context, built-in thinking, native tool calling. Running on Red Hat OpenShift AI with vLLM on Day 0:

Red Hat AI

15,864 Aufrufe • vor 12 Tagen

As part of our goal to make MLX a great research tool, we're expanding support to new languages like Swift and C, making experimentation on Apple silicon easier for ML researchers. Video generating text with Mistral 7B and MLX Swift 👇 MLX is an array framework for machine learning research on Apple silicon. MLX is intended for research and not for production deployment of models in apps.

As part of our goal to make MLX a great research tool, we're expanding support to new languages like Swift and C, making experimentation on Apple silicon easier for ML researchers. Video generating text with Mistral 7B and MLX Swift 👇 MLX is an array framework for machine learning research on Apple silicon. MLX is intended for research and not for production deployment of models in apps.

Awni Hannun

88,726 Aufrufe • vor 2 Jahren

Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + LM Studio to review code and find bugs using Qwen 3.6 Powered by the updated MLX engine with batching in beta in the app The batching speed boost is noticeable

Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + LM Studio to review code and find bugs using Qwen 3.6 Powered by the updated MLX engine with batching in beta in the app The batching speed boost is noticeable

Adrien Grondin

72,719 Aufrufe • vor 26 Tagen

Hot tip for anyone doing AI dev: Use Ollama to easily run models like Deepseek-r1 or Gemma locally on your machine. It downloads them and spins up a server with an OpenAI SDK compatible API The smaller models are fast and good enough to work on new features or debug streaming without having to pay for API requests

Hot tip for anyone doing AI dev: Use Ollama to easily run models like Deepseek-r1 or Gemma locally on your machine. It downloads them and spins up a server with an OpenAI SDK compatible API The smaller models are fast and good enough to work on new features or debug streaming without having to pay for API requests

Wes Bos

153,435 Aufrufe • vor 11 Monaten

Qwen3.5-397B-A17B with Vision running on a Single M3 Ultra 🚀 Shout out to the Qwen team because they cooked with this model! Get started today: > uv pip install -U mlx-vlm Model collection 👇🏽 Repo:

Qwen3.5-397B-A17B with Vision running on a Single M3 Ultra 🚀 Shout out to the Qwen team because they cooked with this model! Get started today: > uv pip install -U mlx-vlm Model collection 👇🏽 Repo:

Prince Canuma

11,386 Aufrufe • vor 3 Monaten

I'm exploring running LLMs locally on iPhones and Macs. I’ve got Llama 3.2 running locally using Apple’s MLX and with support for tool calling. This example runs two LLMs: one to identify tools to call based on the query and one to generate responses based on the tools’ outputs.

I'm exploring running LLMs locally on iPhones and Macs. I’ve got Llama 3.2 running locally using Apple’s MLX and with support for tool calling. This example runs two LLMs: one to identify tools to call based on the query and one to generate responses based on the tools’ outputs.

Simon B. Støvring

47,972 Aufrufe • vor 1 Jahr

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

The video is a Llama v1 7B model implemented in MLX and running on an M2 Ultra. More here: * Train a Transformer LM or fine-tune with LoRA * Text generation with Mistral * Image generation with Stable Diffusion * Speech recognition with Whisper

Awni Hannun

66,547 Aufrufe • vor 2 Jahren

Your Mac is about to run inference like a datacenter. Coming soon to MLX-Swift: Continuous batching: the fastest way to handle multiple inference streams locally. It starts with regular inference and seamlessly upgrades to batched mode when new requests arrive. The best of both worlds. Based on the work of and Awni Hannun

Your Mac is about to run inference like a datacenter. Coming soon to MLX-Swift: Continuous batching: the fastest way to handle multiple inference streams locally. It starts with regular inference and seamlessly upgrades to batched mode when new requests arrive. The best of both worlds. Based on the work of and Awni Hannun

Ronald Mannak

60,570 Aufrufe • vor 7 Monaten

🔥 Thrilled to have worked with Google AI Developers on day-0 MLX support for Gemma 3 QAT! QAT optimizes models during training by simulating low-precision operations, delivering similar performance to FP16 and dramatic memory savings when quantised: • Gemma 3 27B: 54GB → 14.1GB (74% reduction) • Gemma 3 12B: 24GB → 6.6GB (72% reduction) • Gemma 3 4B: 8GB → 2.6GB (67% reduction) • Gemma 3 1B: 2GB → 0.5GB (75% reduction) Get started: > pip install mlx-vlm mlx-lm Model collection 👇🏽

🔥 Thrilled to have worked with Google AI Developers on day-0 MLX support for Gemma 3 QAT! QAT optimizes models during training by simulating low-precision operations, delivering similar performance to FP16 and dramatic memory savings when quantised: • Gemma 3 27B: 54GB → 14.1GB (74% reduction) • Gemma 3 12B: 24GB → 6.6GB (72% reduction) • Gemma 3 4B: 8GB → 2.6GB (67% reduction) • Gemma 3 1B: 2GB → 0.5GB (75% reduction) Get started: > pip install mlx-vlm mlx-lm Model collection 👇🏽

Prince Canuma

11,374 Aufrufe • vor 1 Jahr

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

Ivan Fioravanti ᯅ

23,587 Aufrufe • vor 3 Tagen

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

Awni Hannun

42,751 Aufrufe • vor 10 Monaten

Run the latest Ministral 3 models by Mistral AI on macOS. Running privately on-device and optimized for Apple Silicon with MLX.

Run the latest Ministral 3 models by Mistral AI on macOS. Running privately on-device and optimized for Apple Silicon with MLX.

Locally AI - Local AI Chat

34,894 Aufrufe • vor 5 Monaten

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Running four simultaneous OpenCode agents works well with mlx_lm.server continuous batching and MiniMax M2.1 on an M3 Ultra:

Awni Hannun

95,013 Aufrufe • vor 5 Monaten

oMLX is working really well as single machine inference engine for coding agents! Caching is managed perfectly (it can use a ton of disk space, be aware!) and oQ quantization delivers great results. Behind the scenes it uses the standard MLX building blocks (75% created by Prince Canuma 🙏): - mlx-lm - mlx-vlm - mlx-embeddings - mlx-audio I tested Qwen3.6-35B-A3B-oQ6 on M5 Max with two pi instances and it was fast and furious and leveraging cache like crazy as you can see in the video. Let me try to create some oQ versions (2,4,6?) of MiniMax M2.7 now and then I'll pass to distributed inference. I must win! 💪

oMLX is working really well as single machine inference engine for coding agents! Caching is managed perfectly (it can use a ton of disk space, be aware!) and oQ quantization delivers great results. Behind the scenes it uses the standard MLX building blocks (75% created by Prince Canuma 🙏): - mlx-lm - mlx-vlm - mlx-embeddings - mlx-audio I tested Qwen3.6-35B-A3B-oQ6 on M5 Max with two pi instances and it was fast and furious and leveraging cache like crazy as you can see in the video. Let me try to create some oQ versions (2,4,6?) of MiniMax M2.7 now and then I'll pass to distributed inference. I must win! 💪

Ivan Fioravanti ᯅ

19,268 Aufrufe • vor 1 Monat