Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Gemma 4 is now on Cloudflare Workers AI. Vision, tool calling, reasoning and a 256k context window… Here’s a simple TanStack + Workers AI compliments app. 4 compliments because it’s Gemma 4.

Jilles Soeters

2,163 subscribers

66,683 просмотров • 2 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

Gemma 4 E4B (4-bit) completed a full repo audit by executing Bash code and tool calls locally. Runs on just 6GB RAM.

Gemma 4 E4B (4-bit) completed a full repo audit by executing Bash code and tool calls locally. Runs on just 6GB RAM.

Unsloth AI

149,137 просмотров • 3 месяцев назад

Gemma 4 is here! Our most intelligent open models to date, are built on the same world-class research and tech as Gemini 3, and are sized to run and fine-tune efficiently on local hardware. Check out what Google Gemma 4 brings to devs: 💎 Advanced Reasoning: Deep logic tasks, complex multi-step planning, and beyond 💎 Longer context: Seamlessly analyze entire codebases with context windows of 128K tokens for our edge models and 256K tokens for our largest models 💎 Vision and audio: Rich, multimodal interactions out of the box 💎 140+ languages: Trained on 140+ languages 💎 Apache 2.0 license: industry-standard open-source license

Gemma 4 is here! Our most intelligent open models to date, are built on the same world-class research and tech as Gemini 3, and are sized to run and fine-tune efficiently on local hardware. Check out what Google Gemma 4 brings to devs: 💎 Advanced Reasoning: Deep logic tasks, complex multi-step planning, and beyond 💎 Longer context: Seamlessly analyze entire codebases with context windows of 128K tokens for our edge models and 256K tokens for our largest models 💎 Vision and audio: Rich, multimodal interactions out of the box 💎 140+ languages: Trained on 140+ languages 💎 Apache 2.0 license: industry-standard open-source license

Google for Developers

269,578 просмотров • 3 месяцев назад

i just ran Google's brand new Unsloth Gemma4 12B dense GGUF on my RTX 4060 using llama.cpp + CUDA 13.2 21 tokens per second. on a budget consumer GPU. locally. no API. no cloud. no subscription. and the benchmarks are absolutely cooked # first let's talk architecture because this is genuinely different every multimodal model you've used has a frozen vision encoder + frozen audio encoder + LLM backbone glued together Gemma 4 12B is different it's a single decoder only transformer. that's it. vision? raw 48×48 pixel patches → one matmul → projected directly into the LLM audio? raw 16kHz signal sliced into 40ms frames → linear projection → same LLM input space no encoder tax. no latency penalty. no fragmented memory to put the encoder savings in perspective: old Gemma 4 26B approach: - 550M param vision encoder (frozen) - 300M param audio encoder (frozen) - LLM backbone Gemma 4 12B: - 35M param vision embedder (a single matmul) - no audio encoder at all - LLM backbone handles EVERYTHING 550M → 35M for vision alone. that's a 15x reduction this is why the gemma-4-12b-it-Q4_K_M.gguf is just 6.6 GBs!!! and it has 256K native context context # Benchmarks: AIME 2026 (math olympiad): 77.5% GPQA Diamond (expert science): 78.8% LiveCodeBench v6 (real code): 72% Codeforces ELO: 1659 MMLU Pro: 77.2% MATH-Vision: 79.7% BigBench Extra Hard: 53% inference → llama.cpp, LM Studio, vLLM, SGLang llamacpp flags: -m "gemma-4-12b-it-Q4_K_M.gguf" -ngl 99 -c 8000 -v --port 8080 Available on huggingface now! Link below

Alok

277,107 просмотров • 1 месяц назад

Btw this is all it takes to deploy tanstack start to cloudflare workers ✨ this gives you rate limiting, queues, email workers and a bunch of other cool features

Btw this is all it takes to deploy tanstack start to cloudflare workers ✨ this gives you rate limiting, queues, email workers and a bunch of other cool features

Dev Ed

16,052 просмотров • 1 год назад

GROK 4 FAST 2M CONTEXT – NOW IN YOUR POCKET A reasoning model with a 2M context window — rewriting the cost curve of intelligence! Here’s the Hit List: •⁠ ⁠Free on iOS and Android •⁠ ⁠Multimodal brain built for more than text •⁠ ⁠2M context leaves competitors wheezing Source: xAI

GROK 4 FAST 2M CONTEXT – NOW IN YOUR POCKET A reasoning model with a 2M context window — rewriting the cost curve of intelligence! Here’s the Hit List: •⁠ ⁠Free on iOS and Android •⁠ ⁠Multimodal brain built for more than text •⁠ ⁠2M context leaves competitors wheezing Source: xAI

Mario Nawfal

84,937 просмотров • 9 месяцев назад

Inference on UOMI is now free. Start building with frontier open-source models at no cost: • MiniMax M2.7 • Qwen 3.6 27B • Qwen 3.6 35B A3B • Google Gemma 4 26B A4B • Google Gemma 4 31B Powered by the UOMI Inference Network. Start inferencing:

Inference on UOMI is now free. Start building with frontier open-source models at no cost: • MiniMax M2.7 • Qwen 3.6 27B • Qwen 3.6 35B A3B • Google Gemma 4 26B A4B • Google Gemma 4 31B Powered by the UOMI Inference Network. Start inferencing:

Uomi

12,479 просмотров • 1 месяц назад

Google’s Gemma 4 E2B running on-device on iPhone 17 Pro Gemma 4 is built from the same research as Gemini 3, has image understanding capabilities and can reason if needed Running at ~40tk/s with MLX optimized for Apple Silicon

Google’s Gemma 4 E2B running on-device on iPhone 17 Pro Gemma 4 is built from the same research as Gemini 3, has image understanding capabilities and can reason if needed Running at ~40tk/s with MLX optimized for Apple Silicon

Adrien Grondin

1,043,808 просмотров • 3 месяцев назад

Running Gemma 4 12B on your iPhone? Yes! 🧵 LM Studio + Locally AI latest version with LM Link is really cool! This opens up additional scenarios! My brain is on fire 🔥

Running Gemma 4 12B on your iPhone? Yes! 🧵 LM Studio + Locally AI latest version with LM Link is really cool! This opens up additional scenarios! My brain is on fire 🔥

Ivan Fioravanti ᯅ

19,292 просмотров • 29 дней назад

You don't need to use this stack, but... **Hono + AI SDK + Cloudflare Workers** is Great!

You don't need to use this stack, but... Hono + AI SDK + Cloudflare Workers is Great!

Yusuke Wada

19,367 просмотров • 1 год назад

No Next.js. No React. No TypeScript. just a simple html and js file. deployed on Cloudflare workers. feels… enough.

No Next.js. No React. No TypeScript. just a simple html and js file. deployed on Cloudflare workers. feels… enough.

Aykut

135,387 просмотров • 3 месяцев назад

merjs now runs as a native macOS desktop app(exp). no Electron. no Tauri. no Node.js. one Zig binary. 5.3MB. launches instantly. the same framework that runs on Vercel 's Edge Runtime and Cloudflare workers now runs inside a native AppKit window.

merjs now runs as a native macOS desktop app(exp). no Electron. no Tauri. no Node.js. one Zig binary. 5.3MB. launches instantly. the same framework that runs on Vercel 's Edge Runtime and Cloudflare workers now runs inside a native AppKit window.

Rach

41,123 просмотров • 3 месяцев назад

“AI UGC doesn’t work, it’s a scam.” 1.9M views in 4 days. $30k MRR app.

“AI UGC doesn’t work, it’s a scam.” 1.9M views in 4 days. $30k MRR app.

Simone Canc

80,024 просмотров • 1 месяц назад

Using Gemma 4 E2B for audio transcription on my Pixel 10 Pro. It support max 30 sec for now.

Using Gemma 4 E2B for audio transcription on my Pixel 10 Pro. It support max 30 sec for now.

AshutoshShrivastava

88,935 просмотров • 2 месяцев назад

Gemma 4 looks at a parking lot. Decides what to ask. Calls SAM 3.1. "Segment all vehicles." 64 found. "Now just the white ones." 23 found. One model reasoning and orchestrating. One model executing. Both running locally on a MacBook. MLX. No cloud. No API.

Gemma 4 looks at a parking lot. Decides what to ask. Calls SAM 3.1. "Segment all vehicles." 64 found. "Now just the white ones." 23 found. One model reasoning and orchestrating. One model executing. Both running locally on a MacBook. MLX. No cloud. No API.

Maziyar PANAHI

593,540 просмотров • 2 месяцев назад

🚨Breaking: Tencent Hunyuan just dropped Hunyuan-A13B first open-source hybrid reasoning model, which supports switching between fast and slow thinking modes. - 256K context window - Advanced agentic tool calling capabilities Did a quick test with a front-end question it performed well. Overall, a strong model given its size. More details and how to try👇

🚨Breaking: Tencent Hunyuan just dropped Hunyuan-A13B first open-source hybrid reasoning model, which supports switching between fast and slow thinking modes. - 256K context window - Advanced agentic tool calling capabilities Did a quick test with a front-end question it performed well. Overall, a strong model given its size. More details and how to try👇

AshutoshShrivastava

13,672 просмотров • 1 год назад

Claude fable 5 is cooked Claude fable 5 Vs Gemma 4 26b a4b qat MTP Gemma 4 26b running locally on my 8GB vram single RTX 4060 built this using three.js in a single session and 3 prompts. no cloud, no subscription 100% private unlimited use. how long until it can catch up completely?

Claude fable 5 is cooked Claude fable 5 Vs Gemma 4 26b a4b qat MTP Gemma 4 26b running locally on my 8GB vram single RTX 4060 built this using three.js in a single session and 3 prompts. no cloud, no subscription 100% private unlimited use. how long until it can catch up completely?

Alok

30,782 просмотров • 24 дней назад

iPhone 17 ProでGoogleの最新AIモデルGemma 4動かしてみたら爆速だった🫪

iPhone 17 ProでGoogleの最新AIモデルGemma 4動かしてみたら爆速だった🫪

まみよし

64,192 просмотров • 3 месяцев назад

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Alok

200,913 просмотров • 25 дней назад

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

Alok

63,095 просмотров • 17 дней назад

Gemma 4 sees a kid and three dogs. Decides what matters. Calls SAM 3.1 Mask and bounding box. Spotlight on subjects. Background blur. Background pixelation. Four effects. Fully agentic. Two models talking to each other on a MacBook. No App. No cloud. What would you edit?

Gemma 4 sees a kid and three dogs. Decides what matters. Calls SAM 3.1 Mask and bounding box. Spotlight on subjects. Background blur. Background pixelation. Four effects. Fully agentic. Two models talking to each other on a MacBook. No App. No cloud. What would you edit?

Maziyar PANAHI

30,513 просмотров • 2 месяцев назад