Loading video...

Video Failed to Load

Go Home

Can a VLM see without a vision encoder? We trained one for $100, inspired by Gemma 4 12B. Latency on an M3 Pro MacBook: 112 ms -> 1.1 ms for the image path 30% lower end-to-end image+LLM The architecture is just: patchify the image -> linear projection with pos...

58,819 views • 5 days ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

i just ran Google's brand new Unsloth Gemma4 12B dense GGUF on my RTX 4060 using llama.cpp + CUDA 13.2 21 tokens per second. on a budget consumer GPU. locally. no API. no cloud. no subscription. and the benchmarks are absolutely cooked # first let's talk architecture because this is genuinely different every multimodal model you've used has a frozen vision encoder + frozen audio encoder + LLM backbone glued together Gemma 4 12B is different it's a single decoder only transformer. that's it. vision? raw 48×48 pixel patches → one matmul → projected directly into the LLM audio? raw 16kHz signal sliced into 40ms frames → linear projection → same LLM input space no encoder tax. no latency penalty. no fragmented memory to put the encoder savings in perspective: old Gemma 4 26B approach: - 550M param vision encoder (frozen) - 300M param audio encoder (frozen) - LLM backbone Gemma 4 12B: - 35M param vision embedder (a single matmul) - no audio encoder at all - LLM backbone handles EVERYTHING 550M → 35M for vision alone. that's a 15x reduction this is why the gemma-4-12b-it-Q4_K_M.gguf is just 6.6 GBs!!! and it has 256K native context context # Benchmarks: AIME 2026 (math olympiad): 77.5% GPQA Diamond (expert science): 78.8% LiveCodeBench v6 (real code): 72% Codeforces ELO: 1659 MMLU Pro: 77.2% MATH-Vision: 79.7% BigBench Extra Hard: 53% inference → llama.cpp, LM Studio, vLLM, SGLang llamacpp flags: -m "gemma-4-12b-it-Q4_K_M.gguf" -ngl 99 -c 8000 -v --port 8080 Available on huggingface now! Link below

Alok

277,107 views • 20 days ago