Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Next mlx-vlm release will ship with continuous batching support on the server 🚀 What's coming: → Continuous batching — new requests join the active batch immediately, no waiting. Mixed image + text batches supported → OpenAI-compatible API — field-for-field match with mlx-lm, reasoning/content split for thinking models, tag-aware streaming... → Multi-turn tool calling — full tool use support across streaming and non-streaming, works with Gemma4 and other templates → Vision feature caching — cache image embeddings across turns. Gemma4: 228x speedup, Qwen3.5: 23x on cache hit All running locally on Apple Silicon. Check our this demo running 4 concurrent requests (mixed image + text) to gemma-4-26B-A4B-IT by Google Gemma in bf16 using Pi + MLX-VLM server on my M3 Ultra. One of the requests ingests a 8K resolution image!show more

Prince Canuma

22,254 subscribers

82,169 просмотров • 3 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

🗓️ Release Week Recap Big week. mlx-audio and mlx-vlm are now among some of the fastest-growing OSS projects. Here’s what we shipped last week. Gemma 4 on Apple Silicon Two awesome releases by our partner Google DeepMind & Google Gemma : > Gemma 4 12B — their new dense, unified multimodal model. It uses an encoder free audio path and simplified vision encoder. > Gemma 4 QAT — quantization-aware training checkpoints, optimized to run locally on consumer GPUs and edge devices, compressing the model while preserving the quality you expect from Gemma 4. On the audio 🎧 side we added support for 15+ new TTS, ASR & VAD models, faster long-form transcription, and an expanded OpenAI-compatible audio server. All local on Apple Silicon. Huge thanks to every contributor and my co-maintainer Lucas Newman. 🙏🏽

🗓️ Release Week Recap Big week. mlx-audio and mlx-vlm are now among some of the fastest-growing OSS projects. Here’s what we shipped last week. Gemma 4 on Apple Silicon Two awesome releases by our partner Google DeepMind & Google Gemma : > Gemma 4 12B — their new dense, unified multimodal model. It uses an encoder free audio path and simplified vision encoder. > Gemma 4 QAT — quantization-aware training checkpoints, optimized to run locally on consumer GPUs and edge devices, compressing the model while preserving the quality you expect from Gemma 4. On the audio 🎧 side we added support for 15+ new TTS, ASR & VAD models, faster long-form transcription, and an expanded OpenAI-compatible audio server. All local on Apple Silicon. Huge thanks to every contributor and my co-maintainer Lucas Newman. 🙏🏽

Prince Canuma

11,731 просмотров • 1 месяц назад

Today we're shipping our biggest MLX-VLM release yet: v0.6.0 ...and we are raising 💸 This one's about turning your Apple devices into real local agent machines. From your desk to your pocket. What's new: ⚡ Speculative decoding everywhere — Gemma 4 EAGLE3 + DFlash, Qwen MTP, DeepSeek V4 MTP. Faster tokens, less waiting. 🤖 Agent-ready server — native Anthropic /v1/messages API, stateful /v1/responses, tool calls, Codex context budgets. Plug Claude Code & Codex straight into local models. 👁️ New models galore — DeepSeek V4, ZAYA1-VL, MiniCPM-V 4.6, LFM2 MoE, Step-3.7 Flash, Laguna + more. 🎨 Image gen & editing — FLUX.2 (base + klein), PrismML Bonsai. 🔊 Audio in — Qwen3 Omni, Gemma 4 audio, base64 chat audio. 🧮 TurboQuant KV cache — RHT-correct fast paths for leaner memory. 📦 Modular server, better metrics, cleaner streaming. Run real agents on the hardware already in your hands. Github:

Today we're shipping our biggest MLX-VLM release yet: v0.6.0 ...and we are raising 💸 This one's about turning your Apple devices into real local agent machines. From your desk to your pocket. What's new: ⚡ Speculative decoding everywhere — Gemma 4 EAGLE3 + DFlash, Qwen MTP, DeepSeek V4 MTP. Faster tokens, less waiting. 🤖 Agent-ready server — native Anthropic /v1/messages API, stateful /v1/responses, tool calls, Codex context budgets. Plug Claude Code & Codex straight into local models. 👁️ New models galore — DeepSeek V4, ZAYA1-VL, MiniCPM-V 4.6, LFM2 MoE, Step-3.7 Flash, Laguna + more. 🎨 Image gen & editing — FLUX.2 (base + klein), PrismML Bonsai. 🔊 Audio in — Qwen3 Omni, Gemma 4 audio, base64 chat audio. 🧮 TurboQuant KV cache — RHT-correct fast paths for leaner memory. 📦 Modular server, better metrics, cleaner streaming. Run real agents on the hardware already in your hands. Github:

Prince Canuma

65,678 просмотров • 1 месяц назад

QVAC SDK 0.11.0 is live. 🛠️ This release focuses entirely on unlocking next-generation local compute and advanced visual workflows. What’s new: Next-Gen Models: Core engine updated to the latest version of Fabric, unlocking full support for Qwen 3.5, Qwen 3.6, and Gemma 4. Multi-GPU Support: The SDK can now split workloads across multiple graphics cards on the same machine, allowing you to run significantly larger models completely locally. Multi-Image Conditioning: Blend multiple reference images together in a single generation for advanced style mixing and composition control. On-Device Upscaling: Boost your generated images to high-quality resolutions, running securely on your own hardware. More improvements are waiting under the hood. Check the change logs, update your SDK today, and start building with

QVAC SDK 0.11.0 is live. 🛠️ This release focuses entirely on unlocking next-generation local compute and advanced visual workflows. What’s new: Next-Gen Models: Core engine updated to the latest version of Fabric, unlocking full support for Qwen 3.5, Qwen 3.6, and Gemma 4. Multi-GPU Support: The SDK can now split workloads across multiple graphics cards on the same machine, allowing you to run significantly larger models completely locally. Multi-Image Conditioning: Blend multiple reference images together in a single generation for advanced style mixing and composition control. On-Device Upscaling: Boost your generated images to high-quality resolutions, running securely on your own hardware. More improvements are waiting under the hood. Check the change logs, update your SDK today, and start building with

QVAC

2,006,449 просмотров • 2 месяцев назад

Our first short course with Anthropic! Building Towards Computer Use with Anthropic. This teaches you to build an LLM-based agent that uses a computer interface by generating mouse clicks and keystrokes. Computer Use is an important, emerging capability for LLMs that will let AI agents do many more tasks than were possible before, since it lets them interact with interfaces designed for humans to use, rather than only tools that provide explicit API access. I hope you will enjoy learning about it! This course is taught by Anthropic's Head of Curriculum, Colt_Steele. You'll learn to apply image reasoning and tool use to "use" a computer as follows: a model processes an image of the screen, analyzes it to understand what's going on, and navigates the computer via mouse clicks and keystrokes. This course goes through the key building blocks, and culminates in a demo of an AI assistant that uses a web browser to search for a research paper, downloads the PDF, and finally summarizes the paper for you. In detail, you’ll: - Learn about Anthropic's family of models, when to use which one, and make API requests to Claude - Use multi-modal prompts that combine text and image content blocks, and also work with streaming responses - Improve your prompting by using prompt templates, using XML to structure prompts, and providing examples - Implement prompt caching to reduce cost and latency - Apply tool-use to build a chatbot that can call different tools to respond to queries - See all these building blocks come together in Computer Use demo Please sign up here:

Our first short course with Anthropic! Building Towards Computer Use with Anthropic. This teaches you to build an LLM-based agent that uses a computer interface by generating mouse clicks and keystrokes. Computer Use is an important, emerging capability for LLMs that will let AI agents do many more tasks than were possible before, since it lets them interact with interfaces designed for humans to use, rather than only tools that provide explicit API access. I hope you will enjoy learning about it! This course is taught by Anthropic's Head of Curriculum, Colt_Steele. You'll learn to apply image reasoning and tool use to "use" a computer as follows: a model processes an image of the screen, analyzes it to understand what's going on, and navigates the computer via mouse clicks and keystrokes. This course goes through the key building blocks, and culminates in a demo of an AI assistant that uses a web browser to search for a research paper, downloads the PDF, and finally summarizes the paper for you. In detail, you’ll: - Learn about Anthropic's family of models, when to use which one, and make API requests to Claude - Use multi-modal prompts that combine text and image content blocks, and also work with streaming responses - Improve your prompting by using prompt templates, using XML to structure prompts, and providing examples - Implement prompt caching to reduce cost and latency - Apply tool-use to build a chatbot that can call different tools to respond to queries - See all these building blocks come together in Computer Use demo Please sign up here:

Andrew Ng

170,404 просмотров • 1 год назад

QVAC SDK 0.12.0 is now live, bringing longer context, increased memory optimisation, new modalities, and broader ecosystem support directly to your device. Key Features and Updates: - TurboQuant KV-Cache Quantization: Fit much longer context in the same memory. TurboQuant, an algorithm from Google Research, compresses the KV cache by up to 5x, near-lossless. - Text-to-Video: Generate video from a text prompt, fully local, with the new wan2.1 model in the Diffusion addon - Apple Metal Performance for Flux2-klein: Diffusion on Apple Silicon now matches MLX performance, the native benchmark for Apple GPUs - Robot Control (new VLA addon): A GGML-based Vision-Language-Action addon brings fast, efficient robot control to edge devices - Coding Assistant / Harness Support: QVAC now works with OpenCode and OpenClaw as a local provider. A new @qvac/ai-sdk-provider package automates model registry and provider integration - Cross-Platform Voice: Text-to-speech and Parakeet transcription moved from ONNX to the GGML engine for better CPU and GPU support on macOS, iOS, Windows, Linux, and Android. Parakeet also adds long-term streaming diarization (tracking who spoke when on live audio) - Faster Lightweight Visual Classification: A new GGML-based Classification addon delivers millisecond-level classification, useful where a vision-language model (VLM) would be unnecessarily slow - Under the Hood: Fabric synced to llama.cpp v8828 (from v8189), plus GPU acceleration added to image-upscale models for faster results Full release notes:

QVAC SDK 0.12.0 is now live, bringing longer context, increased memory optimisation, new modalities, and broader ecosystem support directly to your device. Key Features and Updates: - TurboQuant KV-Cache Quantization: Fit much longer context in the same memory. TurboQuant, an algorithm from Google Research, compresses the KV cache by up to 5x, near-lossless. - Text-to-Video: Generate video from a text prompt, fully local, with the new wan2.1 model in the Diffusion addon - Apple Metal Performance for Flux2-klein: Diffusion on Apple Silicon now matches MLX performance, the native benchmark for Apple GPUs - Robot Control (new VLA addon): A GGML-based Vision-Language-Action addon brings fast, efficient robot control to edge devices - Coding Assistant / Harness Support: QVAC now works with OpenCode and OpenClaw as a local provider. A new @qvac/ai-sdk-provider package automates model registry and provider integration - Cross-Platform Voice: Text-to-speech and Parakeet transcription moved from ONNX to the GGML engine for better CPU and GPU support on macOS, iOS, Windows, Linux, and Android. Parakeet also adds long-term streaming diarization (tracking who spoke when on live audio) - Faster Lightweight Visual Classification: A new GGML-based Classification addon delivers millisecond-level classification, useful where a vision-language model (VLM) would be unnecessarily slow - Under the Hood: Fabric synced to llama.cpp v8828 (from v8189), plus GPU acceleration added to image-upscale models for faster results Full release notes:

QVAC

9,932,369 просмотров • 1 месяц назад

QVAC SDK 0.14.0 is live. This release makes the on-device stack faster on mobile, ships the developer-agent path, and takes local text-to-speech to 31 languages. Main highlights: - OpenCode and OpenClaw. The first official OpenCode plugin, plus a maintained OpenClaw compatibility path, both built on managed mode and qvac serve. Point a coding agent at a local model with far less setup and far fewer surprises. - Brain-computer interface transcription, on the SDK. Take recorded neural signal data and decode it into text, fully on-device, no cloud. Stream it in chunks through a simple API. In 0.14 it runs GPU-accelerated on iOS. - Text to Speech in 31 languages with our Supertonic3 upgrade. VOICE AND SPEECH - Supertonic3 multilingual TTS, 5 languages to 31. - Chatterbox and Supertonic now run on the Android GPU, with lower memory use (especially on iOS), quantized s3gen Chatterbox support, and a fix for Chatterbox occasionally emitting random speech. - Whisper transcription now runs on the iOS GPU. Parakeet runs on the Android GPU, with steadier real-time streaming. VISION AND OCR - VLM multi-tile batching: high-resolution Pan and Scan images are encoded in one pass instead of tile by tile, for faster vision throughput. - OCR on ggml (EasyOCR and DocTR) reaches full speed parity with the onnx path, across Metal, OpenCL, and Vulkan. PLATFORM AND RELIABILITY - Dynamic compute backends on Linux: one build picks the right backend at runtime, and opens the door to ROCm and CUDA support without per-backend builds. - Thinking tokens are kept out of the model context, so reasoning no longer fills the KV cache. SDK 0.14.0 is now leaner and faster to start. Let’s build.

QVAC SDK 0.14.0 is live. This release makes the on-device stack faster on mobile, ships the developer-agent path, and takes local text-to-speech to 31 languages. Main highlights: - OpenCode and OpenClaw. The first official OpenCode plugin, plus a maintained OpenClaw compatibility path, both built on managed mode and qvac serve. Point a coding agent at a local model with far less setup and far fewer surprises. - Brain-computer interface transcription, on the SDK. Take recorded neural signal data and decode it into text, fully on-device, no cloud. Stream it in chunks through a simple API. In 0.14 it runs GPU-accelerated on iOS. - Text to Speech in 31 languages with our Supertonic3 upgrade. VOICE AND SPEECH - Supertonic3 multilingual TTS, 5 languages to 31. - Chatterbox and Supertonic now run on the Android GPU, with lower memory use (especially on iOS), quantized s3gen Chatterbox support, and a fix for Chatterbox occasionally emitting random speech. - Whisper transcription now runs on the iOS GPU. Parakeet runs on the Android GPU, with steadier real-time streaming. VISION AND OCR - VLM multi-tile batching: high-resolution Pan and Scan images are encoded in one pass instead of tile by tile, for faster vision throughput. - OCR on ggml (EasyOCR and DocTR) reaches full speed parity with the onnx path, across Metal, OpenCL, and Vulkan. PLATFORM AND RELIABILITY - Dynamic compute backends on Linux: one build picks the right backend at runtime, and opens the door to ROCm and CUDA support without per-backend builds. - Thinking tokens are kept out of the model context, so reasoning no longer fills the KV cache. SDK 0.14.0 is now leaner and faster to start. Let’s build.

QVAC

23,973,950 просмотров • 24 дней назад

QVAC SDK 0.13.0 is live, and this version brings a lot of exciting updates! Local AI now plugs into your coding agent, ships as a desktop app in one command, and runs even more models. Highlights: NEW INTEGRATIONS - OpenCode and coding agents: the new @qvac/ai-sdk-provider makes QVAC a local provider. Less setup, same-model requests queue cleanly, and managed mode starts and supervises qvac serve for you. - Broader OpenAI-compatible API, validated across supported flows so covered capabilities stay consistent and testable. - Turn your QVAC project into a real desktop app for Mac, Windows, or Linux with a single command. The new Electron plugin handles the packaging and keeps the app small by including only what it needs. NEW MODELS - New pi0.5 model support - run a vision-language "robot brain" on a single ordinary graphics card, at full accuracy. - Image-to-video, fully local, via the Wan2.1 model in the Diffusion addon. - New BCI add-on: brain-computer interface transcription, fully local. Decode recorded neural signals into text on-device via the Whisper.cpp-based BCI model. IMPROVEMENTS - Whisper GPU transcription on Android, auto-picking the best backend (OpenCL on Adreno 700+, Vulkan elsewhere), unified on one ggml engine. - Parakeet steadier on mobile, with real end-of-utterance detection for streaming. - Supertonic TTS now runs full GPU across Metal, Vulkan, and OpenCL, with native streaming.

QVAC SDK 0.13.0 is live, and this version brings a lot of exciting updates! Local AI now plugs into your coding agent, ships as a desktop app in one command, and runs even more models. Highlights: NEW INTEGRATIONS - OpenCode and coding agents: the new @qvac/ai-sdk-provider makes QVAC a local provider. Less setup, same-model requests queue cleanly, and managed mode starts and supervises qvac serve for you. - Broader OpenAI-compatible API, validated across supported flows so covered capabilities stay consistent and testable. - Turn your QVAC project into a real desktop app for Mac, Windows, or Linux with a single command. The new Electron plugin handles the packaging and keeps the app small by including only what it needs. NEW MODELS - New pi0.5 model support - run a vision-language "robot brain" on a single ordinary graphics card, at full accuracy. - Image-to-video, fully local, via the Wan2.1 model in the Diffusion addon. - New BCI add-on: brain-computer interface transcription, fully local. Decode recorded neural signals into text on-device via the Whisper.cpp-based BCI model. IMPROVEMENTS - Whisper GPU transcription on Android, auto-picking the best backend (OpenCL on Adreno 700+, Vulkan elsewhere), unified on one ggml engine. - Parakeet steadier on mobile, with real end-of-utterance detection for streaming. - Supertonic TTS now runs full GPU across Metal, Vulkan, and OpenCL, with native streaming.

QVAC

20,922,918 просмотров • 1 месяц назад

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

[CLIP] by Hand ✍️ The CLIP (Contrastive Language–Image Pre-training) model, a groundbreaking work by OpenAI, redefines the intersection of computer vision and natural language processing. It is the basis of all the multi-modal foundation models we see today. How does CLIP work? Goal: 🟨 Learn a shared embedding space for text and image [1] Given ↳ A mini batch of 3 text-image pairs ↳ OpenAI used 400 million text-image pairs to train its original CLIP model. Process 1st pair: "big table" [2] 🟪 Text → 2 Vectors (3D) ↳ Look up word embedding vectors using word2vec. [3] 🟩 Image → 2 Vectors (4D) ↳ Divide the image into two patches. ↳ Flatten each patch [4] Process other pairs ↳ Repeat [2]-[3] [5] 🟪 Text Encoder & 🟩 Image Encoder ↳ Encode input vectors into feature vectors ↳ Here, both encoders are simple one layer perceptron (linear + ReLU) ↳ In practice, the encoders are usually transformer models. [6] 🟪 🟩 Mean Pooling: 2 → 1 vector ↳ Average 2 feature vectors into a single vector by averaging across the columns ↳ The goal is to have one vector to represent each image or text [7] 🟪 🟩 -> 🟨 Projection ↳ Note that the text and image feature vectors from the encoders have different dimensions (3D vs. 4D). ↳ Use a linear layer to project image and text vectors to a 2D shared embedding space. 🏋️ Contrastive Pre-training 🏋️ [8] Prepare for MatMul ↳ Copy text vectors (T1,T2,T3) ↳ Copy the transpose of image vectors (I1,I2,I3) ↳ They are all in the 2D shared embedding space. [9] 🟦 MatMul ↳ Multiply T and I matrices. ↳ This is equivalent to taking dot product between every pair of image and text vectors. ↳ The purpose is to use dot product to estimate the similarity between a pair of image-text. [10] 🟦 Softmax: e^x ↳ Raise e to the power of the number in each cell ↳ To simplify hand calculation, we approximate e^□ with 3^□. [11] 🟦 Softmax: ∑ ↳ Sum each row for 🟩 image→🟪 text ↳ Sum each column for 🟪 text→ 🟩 image [12] 🟦 Softmax: 1 / sum ↳ Divide each element by the column sum to obtain a similarity matrix for 🟪 text→🟩 image ↳ Divide each element by the row sum to obtain a similarity matrix for 🟩 image→🟪 text [13] 🟥 Loss Gradients ↳ The "Targets" for the similarity matrices are Identity Matrices. ↳ Why? If I and T come from the same pair (i=j), we want the highest value, which is 1, and 0 otherwise. ↳ Apply the simple equation of [Similarity - Target] to compute gradients of for both directions. ↳ Why so simple? Because when Softmax and Cross-Entropy Loss are used together, the math magically works out that way. ↳ These gradients kick off the backpropagation process to update weights and biases of the encoders and projection layers (red borders).

Tom Yeh

67,834 просмотров • 2 лет назад

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

🎉 Congrats to MiniMax (official) on releasing MiniMax M3! Frontier coding and agentic capabilities, native image and video input, computer use, and a 1M-token context window, all in a single open model. At the heart of M3 is MSA, a new sparse attention architecture: instead of attending densely over the full KV cache, each query scores 128-token KV blocks and runs attention only over the top blocks. That is what makes 1M-token context practical to serve. M3 runs in vLLM with day-0 support, verified on NVIDIA and AMD hardware: ✨ MSA sparse attention with dedicated prefill and decode kernels ✨ 1M-token context serving with prefix caching and chunked prefill ✨ BF16 and MXFP8 checkpoints, with MoE backends for both Hopper and Blackwell ✨ Native multimodal input (image + video) ✨ Tool calling, reasoning parsing, and thinking-mode control for agent workloads Day-0 support like this is a true team effort. Grateful to the teams at MiniMax (official), NVIDIA AI, AI at AMD, and Inferact, and to the vLLM community for making it happen. 🙏 Deep dive into the implementation, kernel work, and deployment recipes: 🔗

vLLM

40,306 просмотров • 1 месяц назад

Check out mistral.rs, our #Rust-based open source inference engine allowing for fast #LLM serving for a variety of architectures including X-LoRA mixture-of-expert (MoE) models, Llama-3, Mistral/Mixtral, Gemma & many others. Built on the Hugging Face #Candle framework for #Rust w/ custom CUDA kernels in the backend (as well as support for Metal, Apple Accelerate, and Intel MKL for CPU use), you can easily create a REST API OpenAI compatible server or run via Python bindings. Key features include: ✅Prefix caching, continuous batching ✅Flash Attention V2 ✅Device offloading ✅GGUF or Hugging Face models ✅2, 3, 4, 5, 6 and 8 bit quantization ✅X-LoRA MoE non-granular scalings for fast inference ✅Grammar support ✅Continuous batching ✅LoRA support with weight merging ✅LlamaIndex 🦙 integration ...and much more. Incorporation into our GraphReasoning multi-agent modeling framework & LlamaIndex 🦙 allows you to combine in-context learning with adversarial agentic strategies, to dive deep into complex scientific analyses, such as to predict material behaviors, generate hypotheses, analyze papers and data, develop new research concepts, and much more. Check out mistral.rs: Join our Discord here: Rust Trending Rust Language

Check out mistral.rs, our #Rust-based open source inference engine allowing for fast #LLM serving for a variety of architectures including X-LoRA mixture-of-expert (MoE) models, Llama-3, Mistral/Mixtral, Gemma & many others. Built on the Hugging Face #Candle framework for #Rust w/ custom CUDA kernels in the backend (as well as support for Metal, Apple Accelerate, and Intel MKL for CPU use), you can easily create a REST API OpenAI compatible server or run via Python bindings. Key features include: ✅Prefix caching, continuous batching ✅Flash Attention V2 ✅Device offloading ✅GGUF or Hugging Face models ✅2, 3, 4, 5, 6 and 8 bit quantization ✅X-LoRA MoE non-granular scalings for fast inference ✅Grammar support ✅Continuous batching ✅LoRA support with weight merging ✅LlamaIndex 🦙 integration ...and much more. Incorporation into our GraphReasoning multi-agent modeling framework & LlamaIndex 🦙 allows you to combine in-context learning with adversarial agentic strategies, to dive deep into complex scientific analyses, such as to predict material behaviors, generate hypotheses, analyze papers and data, develop new research concepts, and much more. Check out mistral.rs: Join our Discord here: Rust Trending Rust Language

Markus J. Buehler

73,581 просмотров • 2 лет назад

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

A single RTX 4090 (24 GB VRAM) can run the updated gemma 4 31B (dense) model with a 190,000 context window at 33 tokens/second. The VRAM barrier is dying. Google quietly updated Gemma 4, and Unsloth immediately compiled the new quants. I built llama.cpp from source on Ubuntu 22 to benchmark it. Google's stealth update 2 days ago enabled uniform Flash Attention 4 on Hopper to boost prefill and patched the chat template to improve tool calling. The agentic reasoning gains on the benchmark charts are massive: TB2 (Agents): +4.5% (to 25.8%) Tau2 (Telecom): +10.1% (to 62.7%) Running on Ubuntu 22, CUDA 13.0 with a single NVIDIA GeForce RTX 4090. Here is the exact step by step benchmarking process with a massive 28k tokens prompt and the commands I used to squeeze out maximum context without killing my throughput: # 1. The Baseline (Unquantized KV Cache) I started with full GPU offload (-ngl 99) and pushed the context to 40k. llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -ngl 99 -c 40000 -fa on --port 8080 -v VRAM: 23.8 GB (maxed out on card) Throughput: Prefill: 2198.81 t/s | Decode: 35.77 t/s (with 28k tokens prompt) # 2. The CPU Split Trap I tried stretching to 80k context by offloading layers to the CPU (-ngl 52). llama.cpp flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 80000 -ngl 52 -fa on --port 8080 -v Throughput: Prefill: 1212.73 t/s | Decode: 5 t/s (with 28k tokens prompt) # 3. The KV Quantization Breakthrough Instead of spilling layers to the CPU, I kept the model fully on card (-ngl 99) but enabled 8-bit KV cache quantization to free up VRAM. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 100000 --cache-type-k q8_0 --cache-type-v q8_0 -ngl 99 --port 8080 -v VRAM: 23.9 GB Throughput: Prefill: 2139.68 t/s | Decode: 32 t/s (with 28k tokens prompt) Result: 100k tokens of context on a single GPU with practically zero speed loss (and minimal intelligence loss). # 4. The Limit Test (Q4 KV Cache) To find the absolute breaking point, I dropped the KV cache to 4 bit (q4_0) and set -c 190000. flags: ./build/bin/llama-server -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -c 190000 --cache-type-k q4_0 --cache-type-v q4_0 -ngl 99 --port 8080 -v VRAM: 23.8 GB Throughput: Prefill: 2206.66 t/s | Decode: 33 t/s (with 28k tokens prompt) (Note: Pushing it to 220k required dropping to -ngl 58 again, which immediately penalized decode down to 17 t/s). # The Tradeoff: For Max Reasoning: Keep your KV cache unquantized (f16). You get pristine reasoning but hit a strict 40k context ceiling. For Massive Document Retrieval: If you need to feed the model giant codebases, use --cache-type-k q4_0. Getting 190k context at 33 tokens/second on a consumer desktop with a 31b dense model is a cheat code. If you’re rocking a single 3090 or 4090 and slept on Gemma 4 earlier, this update is your cue to dust off the terminal. Hugging Face links to the Unsloth QAT quants are in the replies below.

Alok

74,579 просмотров • 5 дней назад

Andrej Karpathy (co-founder of OpenAI, Eureka Labs): "The future of AI is not all in the cloud. The most important models will run locally on your hardware under your control." June 2026 is the first time that the future is actually accessible to everyone! Here is what the local LLM landscape looks like right now: → raspberry pi 5: coherent chatbot, no GPU required → macbook air: matches GPT-3.5 quality on most tasks natively → used RTX 3090 at $700: something close to GPT-4 running locally → ollama: one command to an openAI-compatible endpoint, every major toolchain integrates with it → LM studio: best GUI, native MLX acceleration on apple silicon, MCP support in 0.4.0 → llama.cpp: the engine everything else runs on, 90MB footprint, runs on raspberry pi to android → the honest case: 80% of daily work is genuinely good enough locally right now Your prompts never leave your machine. no per-token billing. no rate limits. no refusals. The playbook for every hardware tier, from Raspberry Pi to RTX 5090, is here. Bookmark it so you do not lose it! Follow NeilXbt for more local AI and open source intelligence that tracks the hardware and tooling most people discover six months late.

Andrej Karpathy (co-founder of OpenAI, Eureka Labs): "The future of AI is not all in the cloud. The most important models will run locally on your hardware under your control." June 2026 is the first time that the future is actually accessible to everyone! Here is what the local LLM landscape looks like right now: → raspberry pi 5: coherent chatbot, no GPU required → macbook air: matches GPT-3.5 quality on most tasks natively → used RTX 3090 at $700: something close to GPT-4 running locally → ollama: one command to an openAI-compatible endpoint, every major toolchain integrates with it → LM studio: best GUI, native MLX acceleration on apple silicon, MCP support in 0.4.0 → llama.cpp: the engine everything else runs on, 90MB footprint, runs on raspberry pi to android → the honest case: 80% of daily work is genuinely good enough locally right now Your prompts never leave your machine. no per-token billing. no rate limits. no refusals. The playbook for every hardware tier, from Raspberry Pi to RTX 5090, is here. Bookmark it so you do not lose it! Follow NeilXbt for more local AI and open source intelligence that tracks the hardware and tooling most people discover six months late.

NeilXbt

19,405 просмотров • 1 месяц назад

QVAC SDK 0.15.0 is live. This release adds multiple prompts batching, brings a native AMD GPU backend to the stack, moves more vision encoders onto mobile GPUs, and adds a second local coding-agent integration. Main highlights: - Prompt batching for the LLM addon. Batch multiple prompts into one job and process them concurrently, with each answer returned the moment its generation finishes. - Native AMD GPU backend. A first-class HIP/ROCm backend in @qvac/vla-ggml, auto-selected over Vulkan with clean fallback when ROCm is absent. - A second local coding agent. OpenClaw joins OpenCode for local, cloud-free agent workflows. AGENTS - OpenCode plugin update (@qvac/opencode-plugin). Aligned with the current SDK, CLI, and AI SDK provider packages. A fresh install runs OpenCode against managed local QVAC models out of the box, from the default qvac/qwen3.5-9b, with no manual qvac serve setup. - OpenClaw plugin (@qvac/openclaw-plugin). A second coding-agent integration alongside OpenCode. A fresh setup installs the plugin, creates a local qvac provider through onboarding, and runs a QVAC model through OpenClaw🦞's local service path. LANGUAGE MODELS - Prompt batching (LLM addon). Batch multiple prompts in one job and run them concurrently, each answer returns the moment its generation finishes, no waiting on the others. - Reasoning-context trimming on hybrid + recurrent models (@qvac/llm-llamacpp). remove_thinking_from_context now works beyond pure-attention models. Same JS API, no throw. VOICE AND SPEECH - Transcription (transcription-parakeet 0.9.0). More robust CPU fallback on GPU failure and a faster Vulkan backend on Pixel 9. - Text-to-speech features (tts-ggml 0.4.0). Adds LavaSR for noise removal and adjustable output frequency up to 48 kHz, plus Japanese via Chatterbox. - Text-to-speech fixes (tts-ggml 0.4.1). CPU fallback on GPU failure, a q8_0 KV crash fix on Metal with Chatterbox. VISION - Qwen3.5 vision encoder on GPU (Android). Image encoder moves onto the phone GPU, with a smarter tile-grid preprocessor and default image-token caps, for flagship Android: Vulkan on Mali (Pixel 9 Pro) and OpenCL on Adreno 830 (Galaxy S25). - Gemma-4 vision encoder on GPU (Android). Vision encoder runs on the phone GPU instead of CPU, same flagship Android targets. PLATFORM AND PERFORMANCE - AMD GPU backend (@qvac/vla-ggml). Native HIP/ROCm backend, auto-selected over Vulkan with clean fallback when ROCm is absent (Linux x64 only). Comes with ~23% faster than Vulkan, ~14% faster than PyTorch-ROCm, parity preserved. Unified code style. A cleaner, more consistent, easier-to-contribute codebase. Let's build. npm install @qvac/sdk

QVAC SDK 0.15.0 is live. This release adds multiple prompts batching, brings a native AMD GPU backend to the stack, moves more vision encoders onto mobile GPUs, and adds a second local coding-agent integration. Main highlights: - Prompt batching for the LLM addon. Batch multiple prompts into one job and process them concurrently, with each answer returned the moment its generation finishes. - Native AMD GPU backend. A first-class HIP/ROCm backend in @qvac/vla-ggml, auto-selected over Vulkan with clean fallback when ROCm is absent. - A second local coding agent. OpenClaw joins OpenCode for local, cloud-free agent workflows. AGENTS - OpenCode plugin update (@qvac/opencode-plugin). Aligned with the current SDK, CLI, and AI SDK provider packages. A fresh install runs OpenCode against managed local QVAC models out of the box, from the default qvac/qwen3.5-9b, with no manual qvac serve setup. - OpenClaw plugin (@qvac/openclaw-plugin). A second coding-agent integration alongside OpenCode. A fresh setup installs the plugin, creates a local qvac provider through onboarding, and runs a QVAC model through OpenClaw🦞's local service path. LANGUAGE MODELS - Prompt batching (LLM addon). Batch multiple prompts in one job and run them concurrently, each answer returns the moment its generation finishes, no waiting on the others. - Reasoning-context trimming on hybrid + recurrent models (@qvac/llm-llamacpp). remove_thinking_from_context now works beyond pure-attention models. Same JS API, no throw. VOICE AND SPEECH - Transcription (transcription-parakeet 0.9.0). More robust CPU fallback on GPU failure and a faster Vulkan backend on Pixel 9. - Text-to-speech features (tts-ggml 0.4.0). Adds LavaSR for noise removal and adjustable output frequency up to 48 kHz, plus Japanese via Chatterbox. - Text-to-speech fixes (tts-ggml 0.4.1). CPU fallback on GPU failure, a q8_0 KV crash fix on Metal with Chatterbox. VISION - Qwen3.5 vision encoder on GPU (Android). Image encoder moves onto the phone GPU, with a smarter tile-grid preprocessor and default image-token caps, for flagship Android: Vulkan on Mali (Pixel 9 Pro) and OpenCL on Adreno 830 (Galaxy S25). - Gemma-4 vision encoder on GPU (Android). Vision encoder runs on the phone GPU instead of CPU, same flagship Android targets. PLATFORM AND PERFORMANCE - AMD GPU backend (@qvac/vla-ggml). Native HIP/ROCm backend, auto-selected over Vulkan with clean fallback when ROCm is absent (Linux x64 only). Comes with ~23% faster than Vulkan, ~14% faster than PyTorch-ROCm, parity preserved. Unified code style. A cleaner, more consistent, easier-to-contribute codebase. Let's build. npm install @qvac/sdk

QVAC

29,245,702 просмотров • 10 дней назад

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Introducing "Building with Llama 4." This short course is created with Meta AI at Meta, and taught by Amit Sangani, Director of Partner Engineering for Meta’s AI team. Meta’s new Llama 4 has added three new models and introduced the Mixture-of-Experts (MoE) architecture to its family of open-weight models, making them more efficient to serve. In this course, you’ll work with two of the three new models introduced in Llama 4. First is Maverick, a 400B parameter model, with 128 experts and 17B active parameters. Second is Scout, a 109B parameter model with 16 experts and 17B active parameters. Maverick and Scout support long context windows of up to a million tokens and 10M tokens, respectively. The latter is enough to support directly inputting even fairly large GitHub repos for analysis! In hands-on lessons, you’ll build apps using Llama 4’s new multimodal capabilities including reasoning across multiple images and image grounding, in which you can identify elements in images. You’ll also use the official Llama API, work with Llama 4’s long-context abilities, and learn about Llama’s newest open-source tools: its prompt optimization tool that automatically improves system prompts and synthetic data kit that generates high-quality datasets for fine-tuning. If you need an open model, Llama is a great option, and the Llama 4 family is an important part of any GenAI developer's toolkit. Through this course, you’ll learn to call Llama 4 via API, use its optimization tools, and build features that span text, images, and large context. Please sign up here:

Andrew Ng

67,710 просмотров • 1 год назад

This is probably the most complex workflow I’ve ever built, only with open-source tools. It took my 4 days. It takes four inputs: author, title, and style; and generates a full visual animated story in one click in ComfyUI . I worked on it for four days. There are still some bugs, but here’s the first preview. Here’s a quick breakdown: - The four inputs are sent to LLMs with precise instructions to generate: first, prompts for images and image modifications; second, prompts for animations; third, prompts for generating music. - All voices are generated from the text and timed precisely, as they determine the length of each animation segment. - The first image and video are generated to serve as the title, but also as the guide for all other images created for the video. - Titles and subtitles are also added automatically in Comfy. - I also developed a lot of custom nodes for minor frame calculations, mostly to match audio and video. - The full system is a large loop that, for each line of text, generates an image and then a video from that image. The loop was the hardest part to build in this workflow, so it can process either a 20-second video or a 2-minute video with the same input. - There are multiple combinations of LLMs that try to understand the text in the best way to provide the best prompts for images and video. - The final video is assembled entirely within ComfyUI. - The music is generated based on the LLM output and matches the exact timing of the full animation. - Done! For reference, this workflow uses a lot of models and only works on an RTX 6000 Pro with plenty of RAM. My goal is not to replace humans, as I’ll try to explain later, this workflow is highly controlled and can be adapted or reworked at any point by real artists! My aim was to create a tool that can animate text in one go, allowing the AI some freedom while keeping a strict flow. I don’t know yet how I’ll share this workflow with people, I still need to polish it properly, but maybe through Patreon. Anyway, I hope you enjoy my research, and let’s always keep pushing further! :)

This is probably the most complex workflow I’ve ever built, only with open-source tools. It took my 4 days. It takes four inputs: author, title, and style; and generates a full visual animated story in one click in ComfyUI . I worked on it for four days. There are still some bugs, but here’s the first preview. Here’s a quick breakdown: - The four inputs are sent to LLMs with precise instructions to generate: first, prompts for images and image modifications; second, prompts for animations; third, prompts for generating music. - All voices are generated from the text and timed precisely, as they determine the length of each animation segment. - The first image and video are generated to serve as the title, but also as the guide for all other images created for the video. - Titles and subtitles are also added automatically in Comfy. - I also developed a lot of custom nodes for minor frame calculations, mostly to match audio and video. - The full system is a large loop that, for each line of text, generates an image and then a video from that image. The loop was the hardest part to build in this workflow, so it can process either a 20-second video or a 2-minute video with the same input. - There are multiple combinations of LLMs that try to understand the text in the best way to provide the best prompts for images and video. - The final video is assembled entirely within ComfyUI. - The music is generated based on the LLM output and matches the exact timing of the full animation. - Done! For reference, this workflow uses a lot of models and only works on an RTX 6000 Pro with plenty of RAM. My goal is not to replace humans, as I’ll try to explain later, this workflow is highly controlled and can be adapted or reworked at any point by real artists! My aim was to create a tool that can animate text in one go, allowing the AI some freedom while keeping a strict flow. I don’t know yet how I’ll share this workflow with people, I still need to polish it properly, but maybe through Patreon. Anyway, I hope you enjoy my research, and let’s always keep pushing further! :)

Lovis Odin

58,769 просмотров • 10 месяцев назад

All these demo videos make HEAD SWAPPING with Nano Banana look so easy, but then you give it a try and you're like... uh... what? Why didn't that work? Here's what I've found. Nano Banana reads your image, almost literally, so if you write on the image, it reads the text. This is how Higgsfield AI 🧩 has capitalized on the tech: "Write on the image" and give it direction, right? Totally true, but you don't need Higgi to write on your image. Nano Banana will understand your direction regardless of where you write on your image. On one hand, Higgi is really smart, because they're hranessing the tech in a unique way, but the whole "Higgsfield's Banana Placement" is a bit of a misnomer. It's more of a "Banana Placement" and Higgi is just giving you a sort of basic Photoshop-type tool to work with (again, pretty smart), but the real tech is the Banana. 🍌 This is how I head swapped heads in Runway, but Nano Banana maintains the aesthetic qualities of your image almost perfectly, whereas Runway Reference spits out a very Gen-4 looking image. I like using Nano in Freepik (now Magnific), mainly because it's fast and I can get 4 gens at a time, and you need to gen a dozen times of so before you get a winner (most of the time). I was pumped when I saw Freepik introduce the @ reference feature, just like Runway has, but it doesn't seem to work for head swapping. My guess is because that's not really how Nano Banana tech works... ideally. Marco is the person I saw using this "A" and "B" method, back when Nano was on LM Arena, and man-oh-man, it just works... like a charm. You need experiment with how much of the face you blot out, and the angle and facial expression of your new head if you want the blend to be perfect. All of the results in this video are 100% Nano Banana. I did not do any Photoshop work to the images after the fact. I really hope this helps. Let me know if you have any questions. I'm happy to help. And I'll keep posting videos like this if you guys find them useful. Let me know! And if you want more serious, one-on-one AI consultation you can throw something on the books here:

All these demo videos make HEAD SWAPPING with Nano Banana look so easy, but then you give it a try and you're like... uh... what? Why didn't that work? Here's what I've found. Nano Banana reads your image, almost literally, so if you write on the image, it reads the text. This is how Higgsfield AI 🧩 has capitalized on the tech: "Write on the image" and give it direction, right? Totally true, but you don't need Higgi to write on your image. Nano Banana will understand your direction regardless of where you write on your image. On one hand, Higgi is really smart, because they're hranessing the tech in a unique way, but the whole "Higgsfield's Banana Placement" is a bit of a misnomer. It's more of a "Banana Placement" and Higgi is just giving you a sort of basic Photoshop-type tool to work with (again, pretty smart), but the real tech is the Banana. 🍌 This is how I head swapped heads in Runway, but Nano Banana maintains the aesthetic qualities of your image almost perfectly, whereas Runway Reference spits out a very Gen-4 looking image. I like using Nano in Freepik (now Magnific), mainly because it's fast and I can get 4 gens at a time, and you need to gen a dozen times of so before you get a winner (most of the time). I was pumped when I saw Freepik introduce the @ reference feature, just like Runway has, but it doesn't seem to work for head swapping. My guess is because that's not really how Nano Banana tech works... ideally. Marco is the person I saw using this "A" and "B" method, back when Nano was on LM Arena, and man-oh-man, it just works... like a charm. You need experiment with how much of the face you blot out, and the angle and facial expression of your new head if you want the blend to be perfect. All of the results in this video are 100% Nano Banana. I did not do any Photoshop work to the images after the fact. I really hope this helps. Let me know if you have any questions. I'm happy to help. And I'll keep posting videos like this if you guys find them useful. Let me know! And if you want more serious, one-on-one AI consultation you can throw something on the books here:

Jordan Daniel Chesney

61,803 просмотров • 10 месяцев назад

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

We benchmarked leading multimodal foundation models (GPT-4o, Claude 3.5 Sonnet, Gemini, Llama, etc.) on standard computer vision tasks—from segmentation to surface normal estimation—using standard datasets like COCO and ImageNet. These models have made remarkable progress; however, it is unclear exactly where they stand in terms of understanding vision in detail. Especially when it comes to tasks beyond question-answering. How well do they understand an object's segments or geometry? Our analyses yield an assessment that is quantitatively and qualitatively detailed and is compatible with evaluations developed in the field of computer vision over the past decades. Observed trends: 🔹 The foundation models consistently underperform task-specific SOTA models across all tasks. However, they are respectable generalists, which is remarkable as they are presumably trained primarily on image-text-based tasks. 🔹 They perform semantic tasks notably better than geometric ones. 🔹 GPT-4o performs the best among non-reasoning models, getting the top position in 4 out of 6 tasks. 🔹 Reasoning models, e.g., o3, show improvements in geometric tasks. 🔹 The 'image generation' models, e.g., GPT-40 Image Generation, which have been natively trained multimodally, exhibit quirks. E.g., hallucinated objects, misalignment between the input and output, etc. 🔹 While the prompting techniques affect performance, better models exhibit less sensitivity to variations in prompts. We control for the variance introduced by the prompting methods in our experiments. 🌐 Detailed analyses, visualizations: ⌨️ code: 🧵 1/n

Amir Zamir

73,074 просмотров • 1 год назад

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

six months ago this wasn't happening on 8gb vram. running unsloth's Q4_K_XL quant of gemma 4 26b-a4b-it-qat, a sparse MoE model with only 4b active params on a single rtx 4060 laptop gpu, 8gb vram, 20+ tok/s decode. no cloud, no api, no offload hacks. just a gaming laptop on battery. what makes it fit: google's QAT (quantization aware training), plus MTP (multi token prediction) support in the latest llama.cpp builds. that combo is the single biggest unlock for local inference on low vram. rtx 3060, rtx 3070, gtx 1070, gtx 1080, rtx 4050, rtx 4060, rtx 5050, rtx 5060 — any 6-8gb consumer gpu, old or new — this model runs on it. world cup season, so i told it to build a soccer themed flappy bird clone. one shot, zero iteration, fully playable. six months ago an 8gb model could barely clone vanilla flappy bird. now it's shipping a themed game from a sparse MoE model running locally on a laptop battery. inference benchmarks: - decode throughput: 30 tok/s - context: 64k. this is the real unlock. 64k ctx is what makes a hermes agent loop viable locally on this model, not just single-turn chat. llama.cpp flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -c 64000 -cmoe --port 8080 game's deployed on my own site, built and shipped end to end with open source llm, zero closed source api dependency in the pipeline. link in the description. gguf weights on huggingface, link in the comments. pull it down, run it on whatever 8gb card is sitting in your rig. try the game and tell me your score and what you want in v2. local llms on consumer gpus stopped being a meme.

Alok

60,866 просмотров • 1 месяц назад

🔥HOLY SMOKES! $TAO holders! 🚀 SUBNET 19 (VISION) ON BITTENSOR IS ABSOLUTELY CRUSHING IT! In my 5+ years covering crypto and AI, this is one of the most impressive implementations I've seen. The combination of scale, performance, and decentralization is absolutely next level! 🚀 @namoray_dev @Corcel_X 💨 INSANE Speed Performance: - Llama 3.1 8B: 196.18 tokens/s with +107.23% advantage - Llama 3.1 70B: 124.96 tokens/s with +154.96% advantage - Llama 3.2 3B: 166.69 tokens/s with +21.66% advantage 🔥 Top Tier Model Integration: - Meta-Llama-3-70B & 8B Instruct - FLUX.1-schnell for Text-to-Image - ProteusV0.4-Lightning (Text & Image) - Multiple model variations for redundancy 🔥 What Makes This INSANE: - Complete decentralization - No single point of failure - Multiple model choices for redundancy - Real-time performance tracking - Transparent incentive structure The incentive distribution curve shows a healthy network with: - Strong rewards for top performers - Fair distribution across all participants - Clear path for growth and improvement - Sustainable economic model What's truly MIND-BLOWING is how they've managed to: 1. Scale to millions of operations 2. Maintain high quality across multiple tasks 3. Create a fair, competitive marketplace 4. Build in redundancy and reliability 5. Achieve true decentralization This isn't just another subnet - this is the future of decentralized AI inference happening RIGHT NOW! 🔥 1. MASSIVE Scale & Adoption: - We're seeing 7M+ tokens being processed - 14K+ processing steps being executed - Multiple AI models running simultaneously - Incredible miner participation across the network 2. Revolutionary Task Distribution: - Llama 3.1 70B leading with 20% weighting - Avatar Generation at 15% - Perfectly balanced task distribution for optimal network performance - Multiple specialized tasks including Text-to-Image and Image-to-Image processing 3. Elite Performance Metrics: - Top miners hitting 0.00775 incentive rates - Consistent performance across the network - Impressive scaling from top to bottom performers - Strong incentive curve maintaining network quality 📈 Network Performance: - Consistent upward trend in tokens/s - Quality scores maintaining high levels (>0.9) - Steady improvement in miner performance - Rock-solid network reliability ⚡ Platform Highlights: - Permissionless, serverless architecture - Global network of Always-On GPUs - Instant API access - Full decentralization - Multi-model support with seamless switching What makes this TRULY SPECIAL is the consistent upward trajectory in both speed and quality, while maintaining a decentralized architecture. The performance advantages over industry standards (+154.96% for 70B!) are absolutely mind-blowing! 🚀 This isn't just another AI subnet - it's a glimpse into the future of decentralized AI inference! The combination of speed, reliability, and model variety makes this one of the most impressive implementations in the space! 🔥 📽 Watch Now on YouTube and TikTok: Source 🔗

Andy ττ

11,616 просмотров • 1 год назад

In May 2023, a live streaming world record was set with 32 million concurrent viewers watching the finale of the IPL cricket game. How was this system built? Ashutosh Agrawal was the architect behind this system, and he walks us through how live streaming at scale works, how the system was built and tested, and other interesting learnings. Watch or listen: • YouTube: • Spotify: • Apple: --- Brought to you by our wonderful sponsors: • WorkOS — The modern identity platform for B2B SaaS • CodeRabbit — Cut code review time and bugs in half (use the code PRAGMATIC to get one month free) • Augment Code — AI coding assistant that pro engineering teams love --- Three of my biggest takeaways: 1. The architecture behind live streaming systems is surprisingly logical. In the episode, Ashutosh explains how the live streaming system works, starting from the physical cameras on-site, through the production control room (PCR), streams being sliced-and-diced, and the HLS protocol (HTTP Live Streaming) used. 2. There are a LOT of tradeoffs you can play with when live streaming! The tradeoffs between server load, latency, server resources vs client caching are hard decisions to make. Want to reduce the server load? Serve longer chunks to clients, resulting in fewer requests per minute, per client… at the expense of clients potentially lagging more behind. This is just one of many possible decisions to make. 3. “Game day” is such a neat load testing concept. The team at Jio would simulate “game day” load months before the event. They did tell teams when the load test will start: but did not share anything else! Preparing for a “Game day” test is a lot of work, but it can pay off to find parts of the system that shutter under extreme load. See more takeaways and a summary here: Thanks Ashutosh for all these behind-the-scene details!

In May 2023, a live streaming world record was set with 32 million concurrent viewers watching the finale of the IPL cricket game. How was this system built? Ashutosh Agrawal was the architect behind this system, and he walks us through how live streaming at scale works, how the system was built and tested, and other interesting learnings. Watch or listen: • YouTube: • Spotify: • Apple: --- Brought to you by our wonderful sponsors: • WorkOS — The modern identity platform for B2B SaaS • CodeRabbit — Cut code review time and bugs in half (use the code PRAGMATIC to get one month free) • Augment Code — AI coding assistant that pro engineering teams love --- Three of my biggest takeaways: 1. The architecture behind live streaming systems is surprisingly logical. In the episode, Ashutosh explains how the live streaming system works, starting from the physical cameras on-site, through the production control room (PCR), streams being sliced-and-diced, and the HLS protocol (HTTP Live Streaming) used. 2. There are a LOT of tradeoffs you can play with when live streaming! The tradeoffs between server load, latency, server resources vs client caching are hard decisions to make. Want to reduce the server load? Serve longer chunks to clients, resulting in fewer requests per minute, per client… at the expense of clients potentially lagging more behind. This is just one of many possible decisions to make. 3. “Game day” is such a neat load testing concept. The team at Jio would simulate “game day” load months before the event. They did tell teams when the load test will start: but did not share anything else! Preparing for a “Game day” test is a lot of work, but it can pay off to find parts of the system that shutter under extreme load. See more takeaways and a summary here: Thanks Ashutosh for all these behind-the-scene details!

Gergely Orosz

50,597 просмотров • 1 год назад