0xSero

@0xSero • 58,033 subscribers

I want to be free

Shorts

135,833 views

11,809 views

22,535 views

22,762 views

24,131 views

Videos

sweetdream.ai

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Private Show

Join now for exclusive access

Free preview available • Premium content

155,785 views • 10 days ago

38,993 views • 12 days ago

235,591 views • 5 months ago

43,386 views • 1 month ago

11,221 views • 7 days ago

131,800 views • 3 months ago

24,290 views • 18 days ago

31,883 views • 24 days ago

73,323 views • 2 months ago

110,349 views • 4 months ago

74,772 views • 3 months ago

23,422 views • 24 days ago

87,959 views • 3 months ago

32,537 views • 1 month ago

72,465 views • 3 months ago

65,255 views • 2 months ago

27,083 views • 1 month ago

109,943 views • 5 months ago

35,181 views • 1 month ago

51,035 views • 2 months ago

Live Cam

0xSero

Shorts

Codex nooooooo I can't close this popup

What 4 concurrency on GLM sounds like. So gosh damn loud, open air frames are easier to build but so gosh darn loud

GLM-5.2-REAP-NVFP4 This experiment I calibrated the model on 30,000+ samples of my agent sessions, tweets, writing, and codebases. Pretty charming, every dataset produced something different.

The fact that a 284B parameter model is able to in a single 400K token session - ssh into my Homelab to find docs - ssh into lambda / prime intellect - rewrite reap to support the DS4 attention - download models &amp; datasets - run tests to check it works

Asking Codex (GPU) to find and kill Codex (Cerebras)

Videos

Watch Anya Live

GLM-5.2 Nvidia NVFP4 No pruning, no quantising, exact same model card. 110 tok/s single stream. My friend is a genius, holy fudge. Here's the repo, he got full DS4-Flash on a 5090 + DDR5 at 38.5 tok/s This is a breakthrough.

GLM-5.2 on 4x 6000s in ZCode /goal build a nintendo ds emulator in C++, do not use a ready made emulator. Don't stop until you can run the Mario 64 DS rom.

Kimi IS COOKING holy mackerel this is way better than anything I can get out of opus or GPT Has some bugs.. but looks soooo unique and well into my brand, for 1 shot I can't complain.

Exo No more worrying about it. Best model best config best harness for your hardware based on local . ai benchmarks just: exo run

. is continuing to grow and improve, it's gotten to a place where I'm very proud of it. Now to release it.

I promise this is the last pleometric video I drop on you.

Here's my conversation with Philip Kiely. A deep dive into inference engineering, his book his available online for free, and it's worth reading. Inference is becoming more important to people and organisations by the day, learning about the topic will prepare you well.

It's not a dream anymore, this is frontier at home. My next mission is to make this cheaper to run. Do you see the length of the session? Do you see the successful tool calls?

If you're a frontend dev let me make your life easier: 1. Pay for mobbin. 2. Setup mobbin MCP 3. Model will be able to pull all designs off the screens 4. Basically tons of free Figma designs

Holy moly, MiniMax-M2.7 is amazing, watch till the end.

GLM-5.1-478B-NVFP4 Running on: - 4x RTX Pro 6000 - Sglang - 370,000 max tokens (1.75x full context) - p10 27.7 | p90 45.6 tok/s decode (gen) - 1340 tok/s prefill I could get 2x decode if I limit to 64k context (100 tok/s) In this video it operates Figma (:

GLM-5.2 running under my desk, modifying GGUFs to enable MTP on the Spark. What a world we live in, where you can have a megamind running off electrons going really fast.

Local AI can be so good, but you’d need about 12k USD to get it. Then it’s not so great Here’s a Q4 of Qwen3.5-262B-REAP Weights 131GB KVCache 50GB 256,000 context 350 tokens/s prefill warmup 4,000 tokens/s prefill cache 36 tokens/s generation Vision enabled REAP is good

MiniMax-M3 running locally 100+ tok/s Good questions not often that models ask good questions.

Finally, full precision MiniMax-M2.7 running at home. 100 tokens/s decode 5050 tokens/s prefill

OAI has to be prioritising low/no thinking GPT-5.5 inference, it's so fast. I think this is better than high thinking, so much less annoying, so much more token efficient.

Guess who got GLM-5.2 at home? You can't deny I'm elite at inference.

I yapped about LLM Compression for 40 minutes, how much misinformation did i spread this time (,:

I LOVE Deepseek-v4-flash, incredibly reliable and capable, logical. It's lacking in frontend but I have MiMo for that. I would recommend any company spending 100k+ a year on AI to purchase 8-10~ 6000s and have a few of the works to have them blind test these models for work.

The fact that a 284B parameter model is able to in a single 400K token session - ssh into my Homelab to find docs - ssh into lambda / prime intellect - rewrite reap to support the DS4 attention - download models & datasets - run tests to check it works