Ivan Fioravanti ᯅ's banner

Ivan Fioravanti ᯅ

@ivanfioravanti • 39,492 subscribers

GenAI/LLM addicted, Apple MLX, Cloud computing, Kubernetes, Technology Advisor, Investor and Co-Founder & Board Member of CoreView.

Shorts

MLX MiniMax 2.5 running LOCALLY on a single M3 Ultra 512GB! Writing a poem on LLMs at 6bit quantization! 🔥 Let's start some coding, context and distributed tests! Generation: 40.2 tokens-per-sec Peak memory: 186 GB

MLX MiniMax 2.5 running LOCALLY on a single M3 Ultra 512GB! Writing a poem on LLMs at 6bit quantization! 🔥 Let's start some coding, context and distributed tests! Generation: 40.2 tokens-per-sec Peak memory: 186 GB

226,126 görüntüleme

It worked! Hermes Agent + Exo + Qwen3 Coder Next 8bit to create an incredible snake game, with model following 100% specifications passed in prompt! Let's load something bigger now!

It worked! Hermes Agent + Exo + Qwen3 Coder Next 8bit to create an incredible snake game, with model following 100% specifications passed in prompt! Let's load something bigger now!

159,666 görüntüleme

Wait, what? 😱 Apple Foundation Model on Private Cloud Compute is working on fm CLI! Here's a video!

Wait, what? 😱 Apple Foundation Model on Private Cloud Compute is working on fm CLI! Here's a video!

38,712 görüntüleme

MLX news: MOSS-TTS-Local Transformer 1.5 available on mlx-audio now! Thanks to Prince Canuma and Lucas Newman Audio on! 🔉

MLX news: MOSS-TTS-Local Transformer 1.5 available on mlx-audio now! Thanks to Prince Canuma and Lucas Newman Audio on! 🔉

20,292 görüntüleme

First real test on M4 Max 40GPU to transcribe 3 hours of video with mlx_whisper 0.4.1 - M4 Max 2:19 mins (Max Fan no throttling) - M4 Max 2:29 mins (System Fan throttling towards end) GPU freq goes 1.578 --> 952 Here the video (8x) showing throttling kicking in at the end.

First real test on M4 Max 40GPU to transcribe 3 hours of video with mlx_whisper 0.4.1 - M4 Max 2:19 mins (Max Fan no throttling) - M4 Max 2:29 mins (System Fan throttling towards end) GPU freq goes 1.578 --> 952 Here the video (8x) showing throttling kicking in at the end.

216,328 görüntüleme

GLM-4.7 Flash by Z.ai running on M3 Ultra using MLX! 4bit at 81 toks/s 🔥 8bit at 64 toks/s 🔥 (video below) Model converted from transformers to MLX on the fly with Codex + GPT-5.2 HIgh!

GLM-4.7 Flash by Z.ai running on M3 Ultra using MLX! 4bit at 81 toks/s 🔥 8bit at 64 toks/s 🔥 (video below) Model converted from transformers to MLX on the fly with Codex + GPT-5.2 HIgh!

75,425 görüntüleme

This is on M3 Ultra; M5 Ultra will be crazy!

This is on M3 Ultra; M5 Ultra will be crazy!

98,227 görüntüleme

Qwen 3 0.6B is Ultra powerful if fine-tuned! Here I got 83.5% on classification tasks! Only Gemini 2.5 Pro does better with 85%! Video is in real time on M3 Ultra!

Qwen 3 0.6B is Ultra powerful if fine-tuned! Here I got 83.5% on classification tasks! Only Gemini 2.5 Pro does better with 85%! Video is in real time on M3 Ultra!

143,129 görüntüleme

Running Gemma 4 12B on your iPhone? Yes! 🧵 LM Studio + Locally AI latest version with LM Link is really cool! This opens up additional scenarios! My brain is on fire 🔥

Running Gemma 4 12B on your iPhone? Yes! 🧵 LM Studio + Locally AI latest version with LM Link is really cool! This opens up additional scenarios! My brain is on fire 🔥

19,292 görüntüleme

Another incredible p5js animation created by the new King: Sonnet 3.7 Thinking! It's a kind of magic!

Another incredible p5js animation created by the new King: Sonnet 3.7 Thinking! It's a kind of magic!

137,251 görüntüleme

Eulerian Fluid simulation test! Zero-shot! Opus 4.6 vs GPT-5.3 vs Gemini 3 Deep Think! My personal preference: 🥇 Gemini 3 Deep Think (really strong!) 🥈 Opus 4.6 🥉 GPT 5.3 High

Eulerian Fluid simulation test! Zero-shot! Opus 4.6 vs GPT-5.3 vs Gemini 3 Deep Think! My personal preference: 🥇 Gemini 3 Deep Think (really strong!) 🥈 Opus 4.6 🥉 GPT 5.3 High

38,277 görüntüleme

GLM-4.7 and MiniMax M2.1 side by side, both powered by Claude Code! Prompt below. Trick: one setting file per provider: - claude --settings ~/.claude/minimax_settings.json - claude --settings ~/.claude/zai-settings.json

GLM-4.7 and MiniMax M2.1 side by side, both powered by Claude Code! Prompt below. Trick: one setting file per provider: - claude --settings ~/.claude/minimax_settings.json - claude --settings ~/.claude/zai-settings.json

44,866 görüntüleme

Playing with Qwen3-TTS is and MLX-Audio locally on Mac Studio M3 Ultra 🔥 Amazing model by Qwen and great work by Prince Canuma bringing this magic to MLX! Tuning the voice with instructions feels like magic! Command and prompts to run it are in the video.

Playing with Qwen3-TTS is and MLX-Audio locally on Mac Studio M3 Ultra 🔥 Amazing model by Qwen and great work by Prince Canuma bringing this magic to MLX! Tuning the voice with instructions feels like magic! Command and prompts to run it are in the video.

35,720 görüntüleme

Qwen-Image-Lightning on Apple Silicon! M3 Ultra 9 mins --> 42 seconds! 🔥 M4 Max 11mins --> 1:27 mins! 🤩 1st try: MPS + float32 11 mins 2nd try: MPS + bfloat16 9 mins 3rd try: MPS + bfloat16 + Qwen-Image-Lightning LoRA 42 seconds!!! 🤯 Here a video 8x

Qwen-Image-Lightning on Apple Silicon! M3 Ultra 9 mins --> 42 seconds! 🔥 M4 Max 11mins --> 1:27 mins! 🤩 1st try: MPS + float32 11 mins 2nd try: MPS + bfloat16 9 mins 3rd try: MPS + bfloat16 + Qwen-Image-Lightning LoRA 42 seconds!!! 🤯 Here a video 8x

51,001 görüntüleme

I'm finally entering the Google Gemini world too! I subscribed to Ultra so I could test Deep Think! But I hit a wall immediately 😢

I'm finally entering the Google Gemini world too! I subscribed to Ultra so I could test Deep Think! But I hit a wall immediately 😢

27,608 görüntüleme

Qwen QwQ 32B fp16 on M4 Max and M2 Ultra powered by MLX! M2 Ultra - 10.2 toks/s M4 Max - 7.6 toks/s! "Create an amazing animation using p5js" o1-mini level local model! Note: use temp 0.7-0.75 for optimal results in coding. I did some tests and this

Qwen QwQ 32B fp16 on M4 Max and M2 Ultra powered by MLX! M2 Ultra - 10.2 toks/s M4 Max - 7.6 toks/s! "Create an amazing animation using p5js" o1-mini level local model! Note: use temp 0.7-0.75 for optimal results in coding. I did some tests and this

62,377 görüntüleme

DeepSeek R1 Qwen 7B 4bit M2 Ultra vs M4 Max on Apple MLX 🤫 Let them think... (video 4x in center part) M2 Ultra: 114.9 tokens per sec M4 Max (14"): 88.3 tokens per sec

DeepSeek R1 Qwen 7B 4bit M2 Ultra vs M4 Max on Apple MLX 🤫 Let them think... (video 4x in center part) M2 Ultra: 114.9 tokens per sec M4 Max (14"): 88.3 tokens per sec

60,200 görüntüleme

🔥Apple MLX first 6bit model is on Hugging Face!🔥 Qwen2.5-Coder-32B-Instruct-6bit! 3bit conversion and test in progress! Video 8x below on M4 Max 40GPU: - Prompt: 38 tokens, 61.731 tokens-per-sec - Generation: 1181 tokens, 16.939 tokens-per-sec - Peak memory: 25.122 GB

🔥Apple MLX first 6bit model is on Hugging Face!🔥 Qwen2.5-Coder-32B-Instruct-6bit! 3bit conversion and test in progress! Video 8x below on M4 Max 40GPU: - Prompt: 38 tokens, 61.731 tokens-per-sec - Generation: 1181 tokens, 16.939 tokens-per-sec - Peak memory: 25.122 GB

46,494 görüntüleme

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

MLX GLM 5.2 Distributed on two M3 Ultra 512GB 🔥 One M3 Ultra: 18.8 tokens/sec Two M3 Ultra: 23.4 tokens/sec Context: - PR by Pedro Cuenca is still open and probably there is room for improvement: - basic generation test to measure decoding performance here, I will do a full context benchmarking once PR is more mature - nvfp4 quantization used - Video alternates standard speed and x20, with one Mac first and distributed later. Enjoy! 🙌🏻

Ivan Fioravanti ᯅ

87,805 görüntüleme • 29 gün önce

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

GLM-4.7-8bit (350GB) running at 19 toks/s on two M3 Ultra 512GB using Tensor Parallelism with EXO - MLX, versus 14 toks/s with single node. 🚀 Now context benchmarking & then OpenCode tests 🔥 Note: this is from sources, I had to change things to run it.

Ivan Fioravanti ᯅ

327,687 görüntüleme • 6 ay önce

GLM-5.2 8bit running on two M3 Ultra 512GB with MLX distributed? Here it is! 🚀 Decode speed: 17.9 tokens/sec 🔥 Memory used: ~ 760GB 👀 Again keep in mind it's a preliminary PR by super Pedro Cuenca still a WIP!

GLM-5.2 8bit running on two M3 Ultra 512GB with MLX distributed? Here it is! 🚀 Decode speed: 17.9 tokens/sec 🔥 Memory used: ~ 760GB 👀 Again keep in mind it's a preliminary PR by super Pedro Cuenca still a WIP!

Ivan Fioravanti ᯅ

34,140 görüntüleme • 28 gün önce

Local AI to the max! Hermes Agent + Computer Use + Reachy Mini + llamacpp + Gemma E4B QAT (multimodal) + speech-to-speech = FUN! Video extracted and produced by super Stefano from last live done on X.

Local AI to the max! Hermes Agent + Computer Use + Reachy Mini + llamacpp + Gemma E4B QAT (multimodal) + speech-to-speech = FUN! Video extracted and produced by super Stefano from last live done on X.

Ivan Fioravanti ᯅ

17,107 görüntüleme • 15 gün önce

Video on LM Studio MLX v.1.8.5 + Codex + qwen3.6-35b-a3b 6bit on an M5 Max 128GB. Video is in normal speed to give you an idea of the experience. Command used: codex --oss -m qwen/qwen3.6-35b-a3b Local AI is becoming powerful 💪

Video on LM Studio MLX v.1.8.5 + Codex + qwen3.6-35b-a3b 6bit on an M5 Max 128GB. Video is in normal speed to give you an idea of the experience. Command used: codex --oss -m qwen/qwen3.6-35b-a3b Local AI is becoming powerful 💪

Ivan Fioravanti ᯅ

40,021 görüntüleme • 1 ay önce

Local AI in action! MiniMax M3 unning locally on a single M3 Ultra 512GB in Unsloth Studio! 🔥 Here UD-Q5_K_XL decoding at 32.5 toks/s!

Local AI in action! MiniMax M3 unning locally on a single M3 Ultra 512GB in Unsloth Studio! 🔥 Here UD-Q5_K_XL decoding at 32.5 toks/s!

Ivan Fioravanti ᯅ

31,565 görüntüleme • 1 ay önce

What is this Nanbeige4.1-3B model running at - 77 toks/s in bf16 (in video) - 115 toks/s in 8bit on M3 Ultra with MLX with these benchmark scores! 🔥

What is this Nanbeige4.1-3B model running at - 77 toks/s in bf16 (in video) - 115 toks/s in 8bit on M3 Ultra with MLX with these benchmark scores! 🔥

Ivan Fioravanti ᯅ

85,228 görüntüleme • 5 ay önce

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

MiniMax M3 support added to mlx-vlm with MSA implementation! 🚀 Tested on M3 Ultra 512GB running at 24 tps with peak memory ~240GB. Now working on optimizing performance and adding ton of tests 💪 Model is here: PR is here:

Ivan Fioravanti ᯅ

24,376 görüntüleme • 1 ay önce

MLX + OpenCode + Qwen3.5-122B-A10B-4bit on M3 Ultra created a great snake game! Work zero-shot. Video clearly in super fast mode during generation. I generated the prompt using Grok 4.20, it's in the article.

MLX + OpenCode + Qwen3.5-122B-A10B-4bit on M3 Ultra created a great snake game! Work zero-shot. Video clearly in super fast mode during generation. I generated the prompt using Grok 4.20, it's in the article.

Ivan Fioravanti ᯅ

74,659 görüntüleme • 4 ay önce

Hermes Agent 0.17 & MiniMax M3 played with Unreal Engine 5.8 MCP for 45 mins (condensed in 1 min video) and spent 10M tokens. Things are improving, we added Text and a Sphere! 🤣 But we failed to create and apply a texture. Note that I have nearly zero knowledge of Unreal Engine 🤷🏻‍♂️ Still experimental and using a lot of tokens due to various issues in tool calling. But it was a lot of fun to play with! 🚀 I bet we'll see amazing things out of this new UE 5.8 MCP!

Hermes Agent 0.17 & MiniMax M3 played with Unreal Engine 5.8 MCP for 45 mins (condensed in 1 min video) and spent 10M tokens. Things are improving, we added Text and a Sphere! 🤣 But we failed to create and apply a texture. Note that I have nearly zero knowledge of Unreal Engine 🤷🏻‍♂️ Still experimental and using a lot of tokens due to various issues in tool calling. But it was a lot of fun to play with! 🚀 I bet we'll see amazing things out of this new UE 5.8 MCP!

Ivan Fioravanti ᯅ

18,854 görüntüleme • 29 gün önce

MiniMax M3 vs GLM-5.2 vs Kimi K2.7 in Lunar Lander contest! 🔥 Last test of the day before going back to learning & build phase with Local Models, Swift and Apple Core AI 💪 I took the opportunity to test the MiniMax Code app for macOS. Video below speeded up. Followed by Lunar Lander implementation from the various models. My ranking: 🥇 GLM 5.2 🥈 MiniMax M3 🥉 Kimi K2.7 Coding Prompt and final results in this gist:

MiniMax M3 vs GLM-5.2 vs Kimi K2.7 in Lunar Lander contest! 🔥 Last test of the day before going back to learning & build phase with Local Models, Swift and Apple Core AI 💪 I took the opportunity to test the MiniMax Code app for macOS. Video below speeded up. Followed by Lunar Lander implementation from the various models. My ranking: 🥇 GLM 5.2 🥈 MiniMax M3 🥉 Kimi K2.7 Coding Prompt and final results in this gist:

Ivan Fioravanti ᯅ

21,604 görüntüleme • 1 ay önce

Repo Prompt + o3 = Mind Blowing results! Plain video, no edit!

Repo Prompt + o3 = Mind Blowing results! Plain video, no edit!

Ivan Fioravanti ᯅ

179,828 görüntüleme • 1 yıl önce

Web Design Royal Rumble! Claude Code with: - GLM 4.7 (left) - Opus 4.5 (center) - MiniMax M2.1 (right) 🥇 M2.1 fastest 🥈 Opus 4.5 🥉 GLM 4.7 All great, which one is the best? 🤷🏻‍♂️

Web Design Royal Rumble! Claude Code with: - GLM 4.7 (left) - Opus 4.5 (center) - MiniMax M2.1 (right) 🥇 M2.1 fastest 🥈 Opus 4.5 🥉 GLM 4.7 All great, which one is the best? 🤷🏻‍♂️

Ivan Fioravanti ᯅ

91,414 görüntüleme • 6 ay önce

For anyone wandering what does it mean to run ds4-agent locally on an M5 Max using DeepSeek V4 Flash q2-imatrix gguf model. Here's a video of ds4 updating itself, adding a way to leverage HF_HOME for gguf models. Future of Local AI is bright!

For anyone wandering what does it mean to run ds4-agent locally on an M5 Max using DeepSeek V4 Flash q2-imatrix gguf model. Here's a video of ds4 updating itself, adding a way to leverage HF_HOME for gguf models. Future of Local AI is bright!

Ivan Fioravanti ᯅ

30,319 görüntüleme • 1 ay önce

Kimi K2.5 (Kimi CLI) vs MiniMax 2.1 (CC) vs GLM 4.7 (CC). 🔥 Same prompt to create a single-page website for "PHANTOM PROTOCOL" a fictional tactical shooter video game, 0-shot. Spoiler IMO: 🥇 Kimi K2.5 is another league 🥈 MiniMax 2.1 🥉 GLM 4.7

Kimi K2.5 (Kimi CLI) vs MiniMax 2.1 (CC) vs GLM 4.7 (CC). 🔥 Same prompt to create a single-page website for "PHANTOM PROTOCOL" a fictional tactical shooter video game, 0-shot. Spoiler IMO: 🥇 Kimi K2.5 is another league 🥈 MiniMax 2.1 🥉 GLM 4.7

Ivan Fioravanti ᯅ

72,529 görüntüleme • 5 ay önce

MiniMax M2.1 vs Opus 4.5 vs GLM-4.7 built an interactive 3D solar system from scratch. Running on 3 Claude Code in parallel. The winner is: 🥇 GLM-4.7! 🥈 Opus 4.5 (slowest) 🥉 M2.1 (fastest) GLM-4.7 design capabilities are another level vs 4.6!

MiniMax M2.1 vs Opus 4.5 vs GLM-4.7 built an interactive 3D solar system from scratch. Running on 3 Claude Code in parallel. The winner is: 🥇 GLM-4.7! 🥈 Opus 4.5 (slowest) 🥉 M2.1 (fastest) GLM-4.7 design capabilities are another level vs 4.6!

Ivan Fioravanti ᯅ

81,630 görüntüleme • 6 ay önce

Using Starlink and a MacBook to play Blades in the Dark with DeepSeek V4 Flash as GM running on ds4 on a remote M3 Ultra and exposed through tailscale as a service in the middle of mountains. We are living in the best era of humankind! Hey... 👋🏻 not bad the speed eh?

Using Starlink and a MacBook to play Blades in the Dark with DeepSeek V4 Flash as GM running on ds4 on a remote M3 Ultra and exposed through tailscale as a service in the middle of mountains. We are living in the best era of humankind! Hey... 👋🏻 not bad the speed eh?

Ivan Fioravanti ᯅ

12,842 görüntüleme • 23 gün önce

oMLX last version seems having some issues in decoding. 🧐 Here a test with Qwen3.6-35B-A3B-MLX-6bit on M5 Max using. 🥇 LM Studio MLX 1.8.5 → 100.9 toks/s 🥈 mlx-vlm 0.6.2 → 100.1 toks/s 🥉 oMLX 0.4.2 dev3 → 58.7 toks/s 👀 Avg Gen TPS: oMLX 58.7 → LM Studio 100.9 (+71.8%) I have to thank pymike00 that raised this oMLX issue after seeing my video on using it with Codex. I bet there is a bug in oMLX chat and server at the moment, because internal benchmarks are ok, video attached. I bet Jun Kim will fix this soon 💪

oMLX last version seems having some issues in decoding. 🧐 Here a test with Qwen3.6-35B-A3B-MLX-6bit on M5 Max using. 🥇 LM Studio MLX 1.8.5 → 100.9 toks/s 🥈 mlx-vlm 0.6.2 → 100.1 toks/s 🥉 oMLX 0.4.2 dev3 → 58.7 toks/s 👀 Avg Gen TPS: oMLX 58.7 → LM Studio 100.9 (+71.8%) I have to thank pymike00 that raised this oMLX issue after seeing my video on using it with Codex. I bet there is a bug in oMLX chat and server at the moment, because internal benchmarks are ok, video attached. I bet Jun Kim will fix this soon 💪

Ivan Fioravanti ᯅ

20,060 görüntüleme • 1 ay önce

Can we run Hermes Agent in an Apple Container? 🤔 YES WE CAN! Here it is! 💪

Can we run Hermes Agent in an Apple Container? 🤔 YES WE CAN! Here it is! 💪

Ivan Fioravanti ᯅ

13,385 görüntüleme • 27 gün önce

Local Reachy Mini conversations wireless looks like magic! You can bring your friend around the house and get WOW effect from anyone! 🔥 Thanks Andi Marafioti for the blog post on how to set this up: Open-source Realtime API powered by llama.cpp: Parakeet -> Gemma 4 E4B -> Qwen3TTS. Just perfect on M5! Pictures, audio and content remain local and private! Next step adding some tools to search on the web or even better interfacing with my Hermes Agent! 🔥

Local Reachy Mini conversations wireless looks like magic! You can bring your friend around the house and get WOW effect from anyone! 🔥 Thanks Andi Marafioti for the blog post on how to set this up: Open-source Realtime API powered by llama.cpp: Parakeet -> Gemma 4 E4B -> Qwen3TTS. Just perfect on M5! Pictures, audio and content remain local and private! Next step adding some tools to search on the web or even better interfacing with my Hermes Agent! 🔥

Ivan Fioravanti ᯅ

22,408 görüntüleme • 1 ay önce