正在加载视频...

视频加载失败

THIS GUY PUT AN AI ON A RASPBERRY PI AND MADE IT QUESTION ITS OWN EXISTENCE FOREVER he built a physical art installation called "latent reflection" where a language model runs on a $60 raspberry pi 4B with 4GB of RAM no internet, no cloud, and its completely isolated...

292,311 次观看 • 2 个月前 •via X (Twitter)

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

Blaze

93,219 次观看 • 1 个月前

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

Melvin

47,148 次观看 • 5 天前

The creator of High Bandwidth Memory (HBM) put a number on the AI build that should stop every infra investor cold. A cluster of a million GPUs runs at roughly 10-20% utilization (Save this). Kim Jung-ho spent thirty years building what feeds the GPU, and his claim is that the GPU is barely working. Here is what is actually happening. Every time a model generates output, the data has to be read out of memory, computed, and written back. The read and the write swallow almost the entire cycle. While that data moves, the GPU does nothing. It sits there, fully powered, fully paid for, waiting. By Kim's estimate the memory is doing only about 30 percent of the work it needs to do. The processor idles the rest. So a million installed GPUs run at 10 to 20 percent. You are not compute constrained. You are memory constrained, and the expensive part is standing around. Adding more GPUs does not fix this. It gives you more processors starving for the same data. Here is the part that decides the next decade. Memory can grow. When a cell cannot shrink any further, you stack it into a high-rise, layer on layer. A GPU cannot be stacked. It runs too hot and needs a cooler bolted to its back, so the one move that rescues memory is closed to the processor. The thing that can keep stacking compounds. The thing that cannot plateaus. The marginal dollar in an AI build now buys more by fixing the memory path than by bolting on another idle GPU. Which is why the companies that control memory bandwidth and supply are not suppliers to the AI trade. They are the AI trade.

Fireside Alpha

38,370 次观看 • 4 天前

Chamath: Two terms you need to pay attention to in AI are Prefill and Decode “There's two terms that I think you're going to hear a ton about over these next few years.” “The first term is prefill, and the next is decode.” “What prefill and decode are, are two very distinct ways of how models think, and how a model goes through the process of answering a question that you ask it.” “And so when you send a prompt to AI, what happens is that the model processes it. This is called the reading phase or prefill.” “It reads your entire prompt all at once. And then it does a bunch of math, calculates all these relationships between all the words, and it stores them in temporary memory.” “The problem is that this is really compute bound. So it requires massive brute force. And Nvidia GPUs crush here.” “And their architecture is designed for massive parallel processing, which makes them really amazing at digesting these long prompts.” “So the problem just gets bigger and bigger, Nvidia just completely dominates.” “But the next phase though, this critical phase, the decode phase, is the writing phase, right?” “So the model starts to generate a response, you ask it a question and its response, one token at a time.” “And then to pick the next token to pick the next word, it has to look back at everything it has said already so that it doesn't hallucinate.” “The problem is that this is incredibly memory bandwidth constrained.” “And in our architecture, a long time ago, we made these design decisions from day one.” “And so what we did was we took a very different architectural approach, we took a very conservative process technology. We weren't pushing the boundaries of physics.” “And we used a lot of what's called SRAM. So memory on the chip so that we could do this decode thing as well or better than everybody else.” “And so now when you put these two things together, I just think it's going to create a huge acceleration in the ability for this entire infrastructure layer to get much cheaper and much more valuable, which I suspect then it'll have a lot more developer pull, you'll get a lot more applications being built, billions and billions of more people using it.”

The All-In Podcast

563,785 次观看 • 6 个月前

I asked Garry Tan how to use meta prompting to get better at AI: "My partners at YC Jared Friedman and Pete Koomen showed me how to do this. You can take almost anything that you do all the time and just drop it into a context window. And then say, “Here’s a bunch of inputs and outputs." And maybe you also add a bunch of notes. And then you tell it, “Write me a prompt that can act as an agent that takes this input and makes this output over here.” You can do this for almost any type of knowledge work. And you can even introspect. "What are things you notice that I did to convert this from the input to the output?”. And then you can just start using the prompt. Initially, it’s going to suck. Because it’s just not that smart yet. But what’s funny is now, I also use it to Iterate my writing. You can be very direct, "I would never say that", "Don’t say it like this", or "Oh, you used the long word there, use the short word". Just speak to it conversationally. And then when you're happy with the output, you can use that new output to make a new prompt. "Based on this conversation, give me a better initial prompt that incorporates all the things we talked about." And you can do this with literally everything. And in theory, there’s so much it applies to that people do day-to-day. You could use it for tweets. You could use it for editing podcasts. You can use it for pretty much everything. I have a folder of prompts that I use all the time. My YouTube prompt is on v27 or something. I'll go through this process with all the different max models. I'll use GPT 5.2 Pro. I’ll use Grok. I'll use Claude. Then, I’ll take all the outputs from all the models and put them into Claude and say "Here’s my prompt, here’s the output from four LLMs, including yourself. Rate each response and tell me what the pros and cons of each approach are." And I usually say "give it to me in numbered form". And then you can agree with one, disagree with two, tell it three is this or that. And then after that, you say given all of this, synthesize it."

The Peel

51,632 次观看 • 4 个月前

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

Sudo su

32,589 次观看 • 4 天前

Ever since I wired Claude Code to WhatsApp 3 weeks ago, I built a stupidly large infra around it. I mean, opus built it. No clue how the code even looks. The entire thing was vibe coded using my phone. I wanted to see how far I could push it without touching the computer. Everything via WhatsApp. Build what I need on the fly. So the resulting infrastructure will already be battle tested for software development. The entire thing was streamlined with nearly no manual interventions, everything was communicated via WhatsApp using a single script establishing this connection. If the script is down, I need to get home to start it again to resume the development. Claude was upgrading it, debugging it, restarting it while maintaining constant uptime so it could keep communicating with me. I stressed Claude about it, telling it that it will be “in the dark” and other words that deliberately sound scary about losing communications if the script dies. I also refused git and refused cloning the code, I wanted to see Claude adapting to work on a *LIVING* system. The way this whole thing works: Claude has its own dedicated phone number that I am paying for. A real WhatsApp account for it is installed on a real iPhone that is sitting on my desk. All is registered under my name, this is legit setup with no hacks and tricks. I’ve set up a WhatsApp “Community” and multiple different groups under it. Both me and Claude are the admins, so Claude could edit it on my behalf. Each group is a project I am working on and has its own isolated context. The Group description is a system prompt that gets auto-appended to the larger system prompt explaining this setup in general. When I send a message it’s an instant interrupt to Claude Code’s process, just like in the terminal. Voice notes are seamlessly transcribed with a local Whisper model. Images are used with multimodal reading in an isolated parallel session. Multiple groups running in parallel so I can work on all projects at the same time. No cross-talking, everything has an isolated context and history. And because it’s local on my own machine: Everything is REAL. The browser is REAL. I am connected as myself on it to all services because I actually use it in real life. Claude has unlimited internet access, just like humans who use actual browsers. It utilizes custom-made browser tools that I made to control any browser session it wants. Depending on the situation, it can either connect to my existing session or create one for its own. (You can tell it ‘look at my browser for a sec’ then talk about the current page you are on and it just works, pretty cool) My custom browser tools are not perfect (not by a long shot) but I managed to make them work well to the point they are somewhat reliable. This gives Claude full access to my real creds and all the services I actually use. I’m productive AS HELL with this. It really feels like a personal assistant. I ask it to read my emails and msgs, check x .com for news, research arxiv papers, write code, run experiments for me, investigate and reverse engineer github repos, even use my credit card and order things. [I try not to do this one a lot lol so far no disasters]. All from my phone. Super convenient. This is not a product or an open source project (maybe soon of it will make sense). This is just an ugly script I hacked the entire thing is ~600 lines. (ok maybe i did look at the code, but i swear i didn’t edit!) You can also vibe code this from scratch pretty fast and it will probably even end up better. This is just a cool thing so I’m sharing. It is a real speed booster for many things I do on daily basis, mostly boring things. Forcing my routine into some new “agent platform” just didn’t feel right for me. WhatsApp is where I already communicate and look for messages, so I decided that my agents will live there too. AGI in my pocket 24/7.

Yam Peleg

419,495 次观看 • 6 个月前