正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Inference on UOMI is now free. Start building with frontier open-source models at no cost: • MiniMax M2.7 • Qwen 3.6 27B • Qwen 3.6 35B A3B • Google Gemma 4 26B A4B • Google Gemma 4 31B Powered by the UOMI Inference Network. Start inferencing:

Uomi

97,585 subscribers

12,479 次观看 • 1 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside atomic.chat (prompt is below) Device: MacBook Pro M5 Max, 64GB RAM Results: Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens So what is more important: tokens per second, or the quality of the final answer? Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time. In this one-shot Pac-Man gamedev contest, Gemma 4 31B was the clear winner. Its game logic was stronger: click reactions were smoother, and it handled interactions with elements like walls, ghosts, and particle effects better. But this was only one test. Maybe Qwen 3.6 27B can show better results with better settings. Open the comments, try our prompt, and share your result below.

/1 Gemma 4 31B just crushed Qwen 3.6 27B in a local LLM gamedev contest inside atomic.chat (prompt is below) Device: MacBook Pro M5 Max, 64GB RAM Results: Qwen 3.6 27B: 32 tokens/sec · 18m 04s · 33,946 tokens Gemma 4 31B: 27 tokens/sec · 3m 51s · 6,209 tokens So what is more important: tokens per second, or the quality of the final answer? Qwen made a very long response and showed more creativity and visual style. But Gemma gave a shorter, clearer, and more logical answer in much less time. In this one-shot Pac-Man gamedev contest, Gemma 4 31B was the clear winner. Its game logic was stronger: click reactions were smoother, and it handled interactions with elements like walls, ghosts, and particle effects better. But this was only one test. Maybe Qwen 3.6 27B can show better results with better settings. Open the comments, try our prompt, and share your result below.

Chubby♨️

72,477 次观看 • 2 个月前

GEMMA 4 FINISHED 14 MINUTES FASTER THAN QWEN 3.6 DESPITE LOWER TOKENS PER SECOND BY USING 5X FEWER TOKENS

GEMMA 4 FINISHED 14 MINUTES FASTER THAN QWEN 3.6 DESPITE LOWER TOKENS PER SECOND BY USING 5X FEWER TOKENS

0xMarioNawfal

71,957 次观看 • 2 个月前

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25–27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today

Alok

200,913 次观看 • 25 天前

Qwen 3.6 is frontier for local. It also thinks forever. I tried a dumb inference-time trick: make its block obey a tiny grammar. Result: - HumanEval+: 22x fewer think tokens, no accuracy loss - LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens

Qwen 3.6 is frontier for local. It also thinks forever. I tried a dumb inference-time trick: make its block obey a tiny grammar. Result: - HumanEval+: 22x fewer think tokens, no accuracy loss - LiveCodeBench public slice: +14% pass@1, ~5x fewer total tokens

andthattoo

282,081 次观看 • 2 个月前

Claude fable 5 is cooked Claude fable 5 Vs Gemma 4 26b a4b qat MTP Gemma 4 26b running locally on my 8GB vram single RTX 4060 built this using three.js in a single session and 3 prompts. no cloud, no subscription 100% private unlimited use. how long until it can catch up completely?

Claude fable 5 is cooked Claude fable 5 Vs Gemma 4 26b a4b qat MTP Gemma 4 26b running locally on my 8GB vram single RTX 4060 built this using three.js in a single session and 3 prompts. no cloud, no subscription 100% private unlimited use. how long until it can catch up completely?

Alok

30,782 次观看 • 23 天前

Staking $OVR is Now LIVE – Earn Daily Rewards in $UOMI The $OVR staking program is live on Base! Earn daily $UOMI rewards while supporting a revolutionary blockchain ecosystem. Flexible, rewarding, and with no lock-up periods—start staking today! 🧵 👉 Start staking your $OVR here:

Staking $OVR is Now LIVE – Earn Daily Rewards in $UOMI The $OVR staking program is live on Base! Earn daily $UOMI rewards while supporting a revolutionary blockchain ecosystem. Flexible, rewarding, and with no lock-up periods—start staking today! 🧵 👉 Start staking your $OVR here:

Over the Reality 🌐

222,114 次观看 • 1 年前

Dolphin Inference Network node operation is now live for anyone who would like to beta test before we go into production $POD rewards live for testers Repurposing idle GPUs to run Qwen 3.5 35B MoE

Dolphin Inference Network node operation is now live for anyone who would like to beta test before we go into production $POD rewards live for testers Repurposing idle GPUs to run Qwen 3.5 35B MoE

Dolphin

76,340 次观看 • 2 个月前

Qwen3.5-35B-A3B is now in Jan 🔥 It surpasses previous Qwen3 models more than 6× its size. Get the latest Jan at Thanks to Qwen for the base model and Georgi Gerganov for llama.cpp 💛

Qwen3.5-35B-A3B is now in Jan 🔥 It surpasses previous Qwen3 models more than 6× its size. Get the latest Jan at Thanks to Qwen for the base model and Georgi Gerganov for llama.cpp 💛

👋 Jan

34,433 次观看 • 4 个月前

Llama 2: Now on Hugging Chat 🤗🦙 Try out the 70B Chat model for free with super fast inference, web search, and powered by open-source tools! 👉

Llama 2: Now on Hugging Chat 🤗🦙 Try out the 70B Chat model for free with super fast inference, web search, and powered by open-source tools! 👉

Hugging Face

403,558 次观看 • 2 年前

Nemotron 3 Ultra is fast and genuinely good Compared it with 3 frontier models: DeepSeek V4, MiniMax M3, and Qwen 3.7 Max on 2 prompts very impressive results

Nemotron 3 Ultra is fast and genuinely good Compared it with 3 frontier models: DeepSeek V4, MiniMax M3, and Qwen 3.7 Max on 2 prompts very impressive results

GMI Cloud

225,296 次观看 • 28 天前

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

You can now run inference directly on the Llama 4 Hugging Face model page – powered by Together AI!

Together AI

21,489 次观看 • 1 年前

Now that we have amazing open source TTS with fast inference, what are you building?

Now that we have amazing open source TTS with fast inference, what are you building?

Victor M

431,129 次观看 • 1 年前

Powered by the best models out, Start & End frames go crazy on Higgsfield. Introducing Start & End frames powered by Hailuo AI (MiniMax) on Higgsfield. Build your transition fully from scratch. Retweet to get the INSIDER guide on MiniMax Start & End frames in DMs.

Powered by the best models out, Start & End frames go crazy on Higgsfield. Introducing Start & End frames powered by Hailuo AI (MiniMax) on Higgsfield. Build your transition fully from scratch. Retweet to get the INSIDER guide on MiniMax Start & End frames in DMs.

Higgsfield AI 🧩

1,285,244 次观看 • 10 个月前

Perplexity Pro is Now Powered by Cerebras. Perplexity Sonar, now running on Cerebras Inference, delivers answers at an unprecedented 1,200 tokens/s – 10x faster than comparable models.

Perplexity Pro is Now Powered by Cerebras. Perplexity Sonar, now running on Cerebras Inference, delivers answers at an unprecedented 1,200 tokens/s – 10x faster than comparable models.

Cerebras

68,007 次观看 • 1 年前

Announcing Fortytwo’s Swarm Inference A decentralized AI architecture that outperforms the top frontier models from the biggest labs: > ChatGPT 5 (OpenAI), > Gemini 2.5 Pro (Google), > Claude Opus 4.1 (Anthropic), > Grok 4 (xAI), > DeepSeek R1 (DeepSeek). Thread ↓

Announcing Fortytwo’s Swarm Inference A decentralized AI architecture that outperforms the top frontier models from the biggest labs: > ChatGPT 5 (OpenAI), > Gemini 2.5 Pro (Google), > Claude Opus 4.1 (Anthropic), > Grok 4 (xAI), > DeepSeek R1 (DeepSeek). Thread ↓

Fortytwo

171,398 次观看 • 8 个月前

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

my 8 GB VRAM gaming laptop is absolutely going to hate me for this. but I still did it. ran a 31b dense model (Gemma 4 31b Q4) with only 8 GB VRAM last week I ran Gemma 4 26B A4B a mixture of experts model on my RTX 4060 and hit 25–28 tokens/sec using llama.cpp's new MTP support. smooth. snappy. but MoE has a secret: it only activates 4B parameters per token despite having 26B total. that's why it flies. so the real question started haunting me. what if I throw a full, no tricks, every parameter fires on every token, 31B DENSE model at the same machine? # Hardware: GPU: NVIDIA RTX 4060, 8 GB VRAM RAM: 16 GB CPU: Intel Core i7 H Laptop. Gaming. Modest. The model: gemma-4-31B-it-qat-UD-Q4_K_XL.gguf (model's unsloth huggingface link in the comments) This is Google DeepMind's flagship dense model in the Gemma 4 family that can run on single consumer GPU. It packs a hybrid attention architecture, supports up to 256K context natively, and is QAT (Quantization Aware Training) optimized, meaning it retains far more quality than standard post training quants at the same bit depth. This is NOT the MoE. This is 31 BILLION dense parameters, every single one of them loaded. # the flags I used: -m gemma-4-31B-it-qat-UD-Q4_K_XL.gguf -cnv --spec-type draft-mtp --spec-draft-model mtp-gemma-4-31B-it.gguf --spec-draft-n-max 8 --spec-draft-p-min 0.6 -c 6000 -v Multi Token Prediction (MTP) is still active here. Separate draft GGUF required, same as the 26B setup. # Results: → Decode: ~3 tokens/sec → Prefill: ~2 tokens/sec → Context: 6000 tokens → Hardware crying quietly in the corner: yes so is 3 tps actually usable? For real time back and forth chat? Not ideal. You're not having a fluid conversation at 3 tps. but slow ≠ useless. And this is where it gets genuinely interesting. think about how senior devs actually work in a real team. But when something is architectural, deeply complex, or needs serious reasoning? they walk down the hall and escalate to the senior. That's exactly the local AI agent architecture this unlocks: → Fast orchestrator model (Gemma 4 26B MoE at 25+ tps) handles routing, simple queries, tool calls, memory. The junior dev. → Gemma 4 31B dense is the senior, called only when the fast model genuinely hits a wall. Hard multi step reasoning. Complex code generation. Deep architectural decisions. The agentic loop stays fast. Only the hard hops touch the 31B. That's a legitimate production grade local AI architecture on a budget hardware. (requires 2 8gb gpus) other workflows where 3 tps is completely fine: - overnight batch jobs. summarize documents, extract structured data, review code. Fire it off. Sleep. wake up to results. - One shot deep reasoning - Silent code audit loops, you write and test, the 31B reviews diffs and flags issues in the background between your sprints - Any workflow where output quality > output speed A few weeks ago, nobody was running a 30B+ dense model on a single consumer GPU with 8 GB VRAM. At all. Now we're doing it on an Intel i7-H gaming laptop with a NVIDIA RTX 4060, thanks to llama.cpp + QAT quants + MTP speculative drafting. Google DeepMind said the Gemma 4 31B targets "consumer GPUs and workstations." They were not exaggerating. The hardware bar to run serious frontier class models locally keeps dropping. the tools are here. the models are here. you just have to be willing to abuse your laptop a little. what workflows would you actually run on a local 3 tps 31B dense model? genuinely curious. drop it below.

Alok

63,095 次观看 • 16 天前

We’re excited to announce another official launchpad partner for the $UOMI Public Sale! This time, we’re thrilled to team up with Spores Network 📅 The $UOMI IDO on Spores goes live September 6th at 10 AM UTC, running until September 8th at 11 AM UTC. With Spores Network onboard, we continue expanding the journey to bring truly autonomous AI Agents on-chain. And more partners are coming soon. 👉 Register here: The countdown is on: $UOMI TGE goes live September 10th 🔥

We’re excited to announce another official launchpad partner for the $UOMI Public Sale! This time, we’re thrilled to team up with Spores Network 📅 The $UOMI IDO on Spores goes live September 6th at 10 AM UTC, running until September 8th at 11 AM UTC. With Spores Network onboard, we continue expanding the journey to bring truly autonomous AI Agents on-chain. And more partners are coming soon. 👉 Register here: The countdown is on: $UOMI TGE goes live September 10th 🔥

Uomi

28,021 次观看 • 10 个月前

Gemma 4 is here! Our most intelligent open models to date, are built on the same world-class research and tech as Gemini 3, and are sized to run and fine-tune efficiently on local hardware. Check out what Google Gemma 4 brings to devs: 💎 Advanced Reasoning: Deep logic tasks, complex multi-step planning, and beyond 💎 Longer context: Seamlessly analyze entire codebases with context windows of 128K tokens for our edge models and 256K tokens for our largest models 💎 Vision and audio: Rich, multimodal interactions out of the box 💎 140+ languages: Trained on 140+ languages 💎 Apache 2.0 license: industry-standard open-source license

Gemma 4 is here! Our most intelligent open models to date, are built on the same world-class research and tech as Gemini 3, and are sized to run and fine-tune efficiently on local hardware. Check out what Google Gemma 4 brings to devs: 💎 Advanced Reasoning: Deep logic tasks, complex multi-step planning, and beyond 💎 Longer context: Seamlessly analyze entire codebases with context windows of 128K tokens for our edge models and 256K tokens for our largest models 💎 Vision and audio: Rich, multimodal interactions out of the box 💎 140+ languages: Trained on 140+ languages 💎 Apache 2.0 license: industry-standard open-source license

Google for Developers

269,578 次观看 • 3 个月前

Gemma 4 analyzes the video. Generates key questions. Calls Falcon Perception. "Find all the people." 156 found. "Detect only white cars." 8 found. A 26B model is running agentic multi-QA vision orchestration. The models are running locally on a MacBook with MLX. No API.

Gemma 4 analyzes the video. Generates key questions. Calls Falcon Perception. "Find all the people." 156 found. "Detect only white cars." 8 found. A 26B model is running agentic multi-QA vision orchestration. The models are running locally on a MacBook with MLX. No API.

Maziyar PANAHI

158,615 次观看 • 2 个月前

Poe now supports OpenCode An open-source AI coding terminal that pairs with all major models on Poe. One click login, instant access, no extra configuration. Start building now.

Poe now supports OpenCode An open-source AI coding terminal that pairs with all major models on Poe. One click login, instant access, no extra configuration. Start building now.

Poe

44,138 次观看 • 3 个月前