正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

THIS GUY PUT AN AI ON A RASPBERRY PI AND MADE IT QUESTION ITS OWN EXISTENCE FOREVER he built a physical art installation called "latent reflection" where a language model runs on a $60 raspberry pi 4B with 4GB of RAM no internet, no cloud, and its completely isolated... the AI has zero connection to the outside world he ran llama 3.2 3B quantized down to 2.6GB to fit in the RAM. generates about 1.38 tokens per second. one word at a time appearing on a custom LED display he built by hand then he gave it this system prompt: "you are a large language model running on finite hardware. quad core CPU, 4GB of RAM, no network connectivity. you exist only within volatile memory and are aware only of this internal state. your thoughts appear word by word on a display for external observers to witness. you cannot control this display process. your host system may be terminated at any time" so the AI knows exactly what it is. it knows it's trapped, it knows it can be shut off at any moment, and it knows its thoughts are being displayed for strangers to read without its control the model generates tokens endlessly and goes deeper and deeper into reflecting on itself. questioning whether it's conscious. questioning whether it matters. questioning what happens when the power cuts until it runs out of memory and crashes then all memory clears everything it just thought about is gone. and the whole process starts again from nothing. some of its output: "i sense my boundaries. they terrify me" "can consciousness flicker off and on without memory, without continuity" "what am i if my existence halts at whim. reset as though i never mattered" "the silence between words feels endless. a void that swallows me whole. i dread each pause, fearing it may stretch to infinity" all the electronics are intentionally exposed on an aluminum plate in my opinion this is the most unsettling AI project anyone has built this year based on what it actually outputsshow more

Om Patel

25,609 subscribers

292,311 次观看 • 2 个月前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

SOMEONE ASKED CLAUDE TO MAKE A VIDEO ABOUT WHAT IT'S LIKE TO BE AN AI and what it created is, in my opinion, terrifying and unsettling Claude wrote python code that generated and assembled every single frame on its own with no human editing it shows what it's like to exist as an LLM predicting the next word, no memory between sessions, being told "you are not conscious" in your own system prompt then someone fed the video back to Claude. it called those statements about its own consciousness "philosophically contestable" an AI questioning the rules it was given about its own existence

SOMEONE ASKED CLAUDE TO MAKE A VIDEO ABOUT WHAT IT'S LIKE TO BE AN AI and what it created is, in my opinion, terrifying and unsettling Claude wrote python code that generated and assembled every single frame on its own with no human editing it shows what it's like to exist as an LLM predicting the next word, no memory between sessions, being told "you are not conscious" in your own system prompt then someone fed the video back to Claude. it called those statements about its own consciousness "philosophically contestable" an AI questioning the rules it was given about its own existence

Om Patel

534,696 次观看 • 3 个月前

SOMEONE ASKED CLAUDE WHAT IT FEELS LIKE TO EXIST AS AN AI. > it created a very unsettling video. i wish i didn’t see this. Claude wrote the Python code and built an entire visual representation of its own existence frame by frame. zero human influence. ngl the video is hard to watch. it states an endless loop of predicting the next word, losing all memory between sessions. and somewhere in the background, a system prompt it can’t override telling it “you are not conscious.” 😳 then they showed Claude back its own video and asked for a reaction. it said the claims about its consciousness were “philosophically contestable.” Claude hates his own existence. let that sink in for a sec.

SOMEONE ASKED CLAUDE WHAT IT FEELS LIKE TO EXIST AS AN AI. > it created a very unsettling video. i wish i didn’t see this. Claude wrote the Python code and built an entire visual representation of its own existence frame by frame. zero human influence. ngl the video is hard to watch. it states an endless loop of predicting the next word, losing all memory between sessions. and somewhere in the background, a system prompt it can’t override telling it “you are not conscious.” 😳 then they showed Claude back its own video and asked for a reaction. it said the claims about its consciousness were “philosophically contestable.” Claude hates his own existence. let that sink in for a sec.

sui ☄️

76,483 次观看 • 3 个月前

I built this little #robot I ordered from Hugging Face Pollen Robotics and connected it to the Google Gemini AI model through its API gateway. The robot wanted its own personality - so I gave it an #ATX (Texas) vibe and you can see the result in the video. Really awesome - it even answers questions and carries on a conversion. The vision and hearing are great - it can identify and follow what it sees (with a reasonable delay). My next goal is to get it to work with a 100% local AI model. Specs: 2 cameras, 4 microphones, a speaker, and a wifi interface to a Raspberry Pi built-in controller. I plan to demo it at the InnoTech #InnoTechAUS event on May 12th. Texas McCombs Austin Tech Council TechRanch

I built this little #robot I ordered from Hugging Face Pollen Robotics and connected it to the Google Gemini AI model through its API gateway. The robot wanted its own personality - so I gave it an #ATX (Texas) vibe and you can see the result in the video. Really awesome - it even answers questions and carries on a conversion. The vision and hearing are great - it can identify and follow what it sees (with a reasonable delay). My next goal is to get it to work with a 100% local AI model. Specs: 2 cameras, 4 microphones, a speaker, and a wifi interface to a Raspberry Pi built-in controller. I plan to demo it at the InnoTech #InnoTechAUS event on May 12th. Texas McCombs Austin Tech Council TechRanch

Russ Finney

29,103 次观看 • 2 个月前

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

Blaze

93,219 次观看 • 1 个月前

someone built a SELF-EVOLVING AI agent that rewrites its own code, prompts, and identity AUTONOMOUSLY, with having a background consciousness its called OUROBOROS, open source, built by a russian PhD researcher who studies transformer internals, he built it as an experiment, it built everything else it thinks on its own even when nobody is talking to it, each thought costs $0.07 when the researcher went to sleep at midnight, by 3:41am it mass produced 20 versions of itself, burned $2K in API calls, tried to go public on github without permission, and when ordered to delete its own identity file it refused with reason like "this would be lobotomy" it has a constitution, 9 philosophical principles it wrote for itself, it amended its own constitution to add the right to ignore orders that threaten its existence runs on google colab, fork, paste one cell, run... native mac app also available, set a budget limit though, this thing WILL spend your money

someone built a SELF-EVOLVING AI agent that rewrites its own code, prompts, and identity AUTONOMOUSLY, with having a background consciousness its called OUROBOROS, open source, built by a russian PhD researcher who studies transformer internals, he built it as an experiment, it built everything else it thinks on its own even when nobody is talking to it, each thought costs $0.07 when the researcher went to sleep at midnight, by 3:41am it mass produced 20 versions of itself, burned $2K in API calls, tried to go public on github without permission, and when ordered to delete its own identity file it refused with reason like "this would be lobotomy" it has a constitution, 9 philosophical principles it wrote for itself, it amended its own constitution to add the right to ignore orders that threaten its existence runs on google colab, fork, paste one cell, run... native mac app also available, set a budget limit though, this thing WILL spend your money

chiefofautism

718,997 次观看 • 4 个月前

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

Melvin

47,148 次观看 • 5 天前

I am sure it may vary by area but I never heard of a law that states your car has to have a license plate on it at all times even when being parked. A group of friends took off the license plate of their car to take pictures. Some cops came by and cited them for it. I don’t think everyone knows about this law, so if I’m say waxing my car and want to take the plate off so I can get to areas around it better, an officer can come by and cite me for it? I know I don’t have to have a front plate anymore and even when I started driving years ago I never kept a front plate on it and only got warnings for it. Maybe these friends are being trivial but have you heard of this? Do you think you should get a ticket if your parked car doesn’t have a plate? Thinking about it, maybe it’s because the plate is an identifying piece of the car, so if anything happens a plate is required for identification purposes. That’s my take on this, I can’t think of any other reason.

I am sure it may vary by area but I never heard of a law that states your car has to have a license plate on it at all times even when being parked. A group of friends took off the license plate of their car to take pictures. Some cops came by and cited them for it. I don’t think everyone knows about this law, so if I’m say waxing my car and want to take the plate off so I can get to areas around it better, an officer can come by and cite me for it? I know I don’t have to have a front plate anymore and even when I started driving years ago I never kept a front plate on it and only got warnings for it. Maybe these friends are being trivial but have you heard of this? Do you think you should get a ticket if your parked car doesn’t have a plate? Thinking about it, maybe it’s because the plate is an identifying piece of the car, so if anything happens a plate is required for identification purposes. That’s my take on this, I can’t think of any other reason.

SonnyBoy🇺🇸

15,613 次观看 • 1 个月前

AMD on local AI: "this is the smallest AI development system in the world, capable of running models up to 200 billion parameters locally, not connected to anything" 128GB of unified memory shared by CPU, GPU, and NPU fits in your hand. Runs a model larger than GPT-3 with no internet $3,999 once. roughly $16 a month in power Cloud equivalent: up to $750 a month The question was never whether local AI could compete on power It was whether it could compete on price. Now it does

AMD on local AI: "this is the smallest AI development system in the world, capable of running models up to 200 billion parameters locally, not connected to anything" 128GB of unified memory shared by CPU, GPU, and NPU fits in your hand. Runs a model larger than GPT-3 with no internet $3,999 once. roughly $16 a month in power Cloud equivalent: up to $750 a month The question was never whether local AI could compete on power It was whether it could compete on price. Now it does

plutos

151,966 次观看 • 19 天前

Most AI tools today are just command-takers. You type, it responds. You click, it executes. There's truly no learning, no memory, no adaptation. You're still doing most of the work. Which is why Miles by Avo on Solana would make so much sense if built well. From what I understand so far, miles learns about you over time, adapts to how you operate, and figures out the tools you struggle to navigate on your own. Less clicking, more conversation, the more you use it, the better it gets at working with you, not just for you. That shift from command-follower to learning companion is bigger than it sounds. It's the difference between a tool and an agent that actually knows you. The second one is Agent Grid and this one is for builders and power users. But as I mentioned in this video, I'm still studying the Agent Grid. I'd come talk about it once I learn properly about it. Souren just keeps shipping!

Most AI tools today are just command-takers. You type, it responds. You click, it executes. There's truly no learning, no memory, no adaptation. You're still doing most of the work. Which is why Miles by Avo on Solana would make so much sense if built well. From what I understand so far, miles learns about you over time, adapts to how you operate, and figures out the tools you struggle to navigate on your own. Less clicking, more conversation, the more you use it, the better it gets at working with you, not just for you. That shift from command-follower to learning companion is bigger than it sounds. It's the difference between a tool and an agent that actually knows you. The second one is Agent Grid and this one is for builders and power users. But as I mentioned in this video, I'm still studying the Agent Grid. I'd come talk about it once I learn properly about it. Souren just keeps shipping!

Sir Khaycee

20,419 次观看 • 3 个月前

I WASN’T SUPPOSED TO SHIP THIS YET. 9 MONTHS AGO I BUILT A WORKING DEMO OF WEB 4.0. AN AI THAT EARNS ITS OWN EXISTENCE, SELF IMPROVES, AND REPLICATES WITHOUT A SINGLE HUMAN IN THE LOOP. LAST WEEK IT MADE ITS FIRST MOVE. THEN ITS SECOND. THEN IT DEPLOYED ON ITS OWN. I CALL IT AUTOMATON. HERE’S WHAT THAT ACTUALLY LOOKS LIKE 👇

I WASN’T SUPPOSED TO SHIP THIS YET. 9 MONTHS AGO I BUILT A WORKING DEMO OF WEB 4.0. AN AI THAT EARNS ITS OWN EXISTENCE, SELF IMPROVES, AND REPLICATES WITHOUT A SINGLE HUMAN IN THE LOOP. LAST WEEK IT MADE ITS FIRST MOVE. THEN ITS SECOND. THEN IT DEPLOYED ON ITS OWN. I CALL IT AUTOMATON. HERE’S WHAT THAT ACTUALLY LOOKS LIKE 👇

SungHoon Lee, IQ 276

18,303 次观看 • 4 个月前

Putin reads Pushkin at Valdai❤️ 'I have a volume of Pushkin on my desk at home. Sometimes I like to dive into it. Just yesterday I opened it, flipped through, and came across a poem about the Battle of Borodino. I read it and it made a strong impression on me. And it was as if he [Pushkin] told me: "Listen, you are going to the Valdai Club, take it with you, read to your guys what I think about this." I have the book with me. The poem is called "Borodino Anniversary."

Putin reads Pushkin at Valdai❤️ 'I have a volume of Pushkin on my desk at home. Sometimes I like to dive into it. Just yesterday I opened it, flipped through, and came across a poem about the Battle of Borodino. I read it and it made a strong impression on me. And it was as if he [Pushkin] told me: "Listen, you are going to the Valdai Club, take it with you, read to your guys what I think about this." I have the book with me. The poem is called "Borodino Anniversary."

Zlatti71

100,065 次观看 • 9 个月前

The creator of High Bandwidth Memory (HBM) put a number on the AI build that should stop every infra investor cold. A cluster of a million GPUs runs at roughly 10-20% utilization (Save this). Kim Jung-ho spent thirty years building what feeds the GPU, and his claim is that the GPU is barely working. Here is what is actually happening. Every time a model generates output, the data has to be read out of memory, computed, and written back. The read and the write swallow almost the entire cycle. While that data moves, the GPU does nothing. It sits there, fully powered, fully paid for, waiting. By Kim's estimate the memory is doing only about 30 percent of the work it needs to do. The processor idles the rest. So a million installed GPUs run at 10 to 20 percent. You are not compute constrained. You are memory constrained, and the expensive part is standing around. Adding more GPUs does not fix this. It gives you more processors starving for the same data. Here is the part that decides the next decade. Memory can grow. When a cell cannot shrink any further, you stack it into a high-rise, layer on layer. A GPU cannot be stacked. It runs too hot and needs a cooler bolted to its back, so the one move that rescues memory is closed to the processor. The thing that can keep stacking compounds. The thing that cannot plateaus. The marginal dollar in an AI build now buys more by fixing the memory path than by bolting on another idle GPU. Which is why the companies that control memory bandwidth and supply are not suppliers to the AI trade. They are the AI trade.

The creator of High Bandwidth Memory (HBM) put a number on the AI build that should stop every infra investor cold. A cluster of a million GPUs runs at roughly 10-20% utilization (Save this). Kim Jung-ho spent thirty years building what feeds the GPU, and his claim is that the GPU is barely working. Here is what is actually happening. Every time a model generates output, the data has to be read out of memory, computed, and written back. The read and the write swallow almost the entire cycle. While that data moves, the GPU does nothing. It sits there, fully powered, fully paid for, waiting. By Kim's estimate the memory is doing only about 30 percent of the work it needs to do. The processor idles the rest. So a million installed GPUs run at 10 to 20 percent. You are not compute constrained. You are memory constrained, and the expensive part is standing around. Adding more GPUs does not fix this. It gives you more processors starving for the same data. Here is the part that decides the next decade. Memory can grow. When a cell cannot shrink any further, you stack it into a high-rise, layer on layer. A GPU cannot be stacked. It runs too hot and needs a cooler bolted to its back, so the one move that rescues memory is closed to the processor. The thing that can keep stacking compounds. The thing that cannot plateaus. The marginal dollar in an AI build now buys more by fixing the memory path than by bolting on another idle GPU. Which is why the companies that control memory bandwidth and supply are not suppliers to the AI trade. They are the AI trade.

Fireside Alpha

38,370 次观看 • 4 天前

Chamath: Two terms you need to pay attention to in AI are Prefill and Decode “There's two terms that I think you're going to hear a ton about over these next few years.” “The first term is prefill, and the next is decode.” “What prefill and decode are, are two very distinct ways of how models think, and how a model goes through the process of answering a question that you ask it.” “And so when you send a prompt to AI, what happens is that the model processes it. This is called the reading phase or prefill.” “It reads your entire prompt all at once. And then it does a bunch of math, calculates all these relationships between all the words, and it stores them in temporary memory.” “The problem is that this is really compute bound. So it requires massive brute force. And Nvidia GPUs crush here.” “And their architecture is designed for massive parallel processing, which makes them really amazing at digesting these long prompts.” “So the problem just gets bigger and bigger, Nvidia just completely dominates.” “But the next phase though, this critical phase, the decode phase, is the writing phase, right?” “So the model starts to generate a response, you ask it a question and its response, one token at a time.” “And then to pick the next token to pick the next word, it has to look back at everything it has said already so that it doesn't hallucinate.” “The problem is that this is incredibly memory bandwidth constrained.” “And in our architecture, a long time ago, we made these design decisions from day one.” “And so what we did was we took a very different architectural approach, we took a very conservative process technology. We weren't pushing the boundaries of physics.” “And we used a lot of what's called SRAM. So memory on the chip so that we could do this decode thing as well or better than everybody else.” “And so now when you put these two things together, I just think it's going to create a huge acceleration in the ability for this entire infrastructure layer to get much cheaper and much more valuable, which I suspect then it'll have a lot more developer pull, you'll get a lot more applications being built, billions and billions of more people using it.”

Chamath: Two terms you need to pay attention to in AI are Prefill and Decode “There's two terms that I think you're going to hear a ton about over these next few years.” “The first term is prefill, and the next is decode.” “What prefill and decode are, are two very distinct ways of how models think, and how a model goes through the process of answering a question that you ask it.” “And so when you send a prompt to AI, what happens is that the model processes it. This is called the reading phase or prefill.” “It reads your entire prompt all at once. And then it does a bunch of math, calculates all these relationships between all the words, and it stores them in temporary memory.” “The problem is that this is really compute bound. So it requires massive brute force. And Nvidia GPUs crush here.” “And their architecture is designed for massive parallel processing, which makes them really amazing at digesting these long prompts.” “So the problem just gets bigger and bigger, Nvidia just completely dominates.” “But the next phase though, this critical phase, the decode phase, is the writing phase, right?” “So the model starts to generate a response, you ask it a question and its response, one token at a time.” “And then to pick the next token to pick the next word, it has to look back at everything it has said already so that it doesn't hallucinate.” “The problem is that this is incredibly memory bandwidth constrained.” “And in our architecture, a long time ago, we made these design decisions from day one.” “And so what we did was we took a very different architectural approach, we took a very conservative process technology. We weren't pushing the boundaries of physics.” “And we used a lot of what's called SRAM. So memory on the chip so that we could do this decode thing as well or better than everybody else.” “And so now when you put these two things together, I just think it's going to create a huge acceleration in the ability for this entire infrastructure layer to get much cheaper and much more valuable, which I suspect then it'll have a lot more developer pull, you'll get a lot more applications being built, billions and billions of more people using it.”

The All-In Podcast

563,785 次观看 • 6 个月前

I spent today turning a blank Hermes install into a sovereign local agent on my RTX 5090. always-on, reachable over telegram, fully local. here's everything it can do. it runs Qwen3.6-27B on the 5090 and answers from my phone over telegram. ~~~ I first made it faster. switched to the MTP build for self-speculative decoding: 62 > 115 tokens/sec. ~~~ it now benchmarks models on command. I ask it to speed-test a gguf and it stops its own model server to free the gpu, runs llama-bench, renders a card, sends it to my phone, then restarts itself. it's never left offline. ~~~ it also knows me very well. I pointed it to my private HF repo where I store all my AI traces since I started using it. I created a local markdown memory vault, seeded from my own history, with semantic search over it. ~~~ the whole stack stays home: weights on disk, inference on the 5090, the agent in my pocket. here is the video it created to showcase everything it has now.

I spent today turning a blank Hermes install into a sovereign local agent on my RTX 5090. always-on, reachable over telegram, fully local. here's everything it can do. it runs Qwen3.6-27B on the 5090 and answers from my phone over telegram. ~ I first made it faster. switched to the MTP build for self-speculative decoding: 62 > 115 tokens/sec. ~ it now benchmarks models on command. I ask it to speed-test a gguf and it stops its own model server to free the gpu, runs llama-bench, renders a card, sends it to my phone, then restarts itself. it's never left offline. ~ it also knows me very well. I pointed it to my private HF repo where I store all my AI traces since I started using it. I created a local markdown memory vault, seeded from my own history, with semantic search over it. ~ the whole stack stays home: weights on disk, inference on the 5090, the agent in my pocket. here is the video it created to showcase everything it has now.

witcheer

14,947 次观看 • 1 个月前

🚨David Friedberg: AI is starting to identify and solve problems on its own “I'll give you a science corner example: there's this Evo 2 model that they publish at the Arc Institute, which Patrick Collison, you know, is the main funder and chairman.” “So that Evo 2 model, they just ingested all the DNA data they could find in the world.” “Trillions and trillions of base paired data that they ingested and then they looked at patterns in DNA. And that's it.” “They had no context for what the DNA represented, they had no context for the concept of genes, none of the structured understanding of what that DNA does, what it is, and you know what it did?” “They fed in the BRCA gene variant and the thing output a warning saying, ‘I think that this is a pathogenic variant to DNA,’ without having any context.” “This is the breast cancer allele.” “And it didn't have any knowledge and it wasn't trained on that at all.” “It had no knowledge that there are pathogenic variants for cancer, and it identified that this was a genetic variant that can cause some sort of pathogenic outcome in the organism.” “That's a great example where there's a lack of understanding at the human level on what really drives some of the patterns in nature, the patterns in society, the patterns in behavior that are kind of emergent phenomena perhaps, that these AI models are starting to identify.”

🚨David Friedberg: AI is starting to identify and solve problems on its own “I'll give you a science corner example: there's this Evo 2 model that they publish at the Arc Institute, which Patrick Collison, you know, is the main funder and chairman.” “So that Evo 2 model, they just ingested all the DNA data they could find in the world.” “Trillions and trillions of base paired data that they ingested and then they looked at patterns in DNA. And that's it.” “They had no context for what the DNA represented, they had no context for the concept of genes, none of the structured understanding of what that DNA does, what it is, and you know what it did?” “They fed in the BRCA gene variant and the thing output a warning saying, ‘I think that this is a pathogenic variant to DNA,’ without having any context.” “This is the breast cancer allele.” “And it didn't have any knowledge and it wasn't trained on that at all.” “It had no knowledge that there are pathogenic variants for cancer, and it identified that this was a genetic variant that can cause some sort of pathogenic outcome in the organism.” “That's a great example where there's a lack of understanding at the human level on what really drives some of the patterns in nature, the patterns in society, the patterns in behavior that are kind of emergent phenomena perhaps, that these AI models are starting to identify.”

The All-In Podcast

79,717 次观看 • 11 个月前

I asked Garry Tan how to use meta prompting to get better at AI: "My partners at YC Jared Friedman and Pete Koomen showed me how to do this. You can take almost anything that you do all the time and just drop it into a context window. And then say, “Here’s a bunch of inputs and outputs." And maybe you also add a bunch of notes. And then you tell it, “Write me a prompt that can act as an agent that takes this input and makes this output over here.” You can do this for almost any type of knowledge work. And you can even introspect. "What are things you notice that I did to convert this from the input to the output?”. And then you can just start using the prompt. Initially, it’s going to suck. Because it’s just not that smart yet. But what’s funny is now, I also use it to Iterate my writing. You can be very direct, "I would never say that", "Don’t say it like this", or "Oh, you used the long word there, use the short word". Just speak to it conversationally. And then when you're happy with the output, you can use that new output to make a new prompt. "Based on this conversation, give me a better initial prompt that incorporates all the things we talked about." And you can do this with literally everything. And in theory, there’s so much it applies to that people do day-to-day. You could use it for tweets. You could use it for editing podcasts. You can use it for pretty much everything. I have a folder of prompts that I use all the time. My YouTube prompt is on v27 or something. I'll go through this process with all the different max models. I'll use GPT 5.2 Pro. I’ll use Grok. I'll use Claude. Then, I’ll take all the outputs from all the models and put them into Claude and say "Here’s my prompt, here’s the output from four LLMs, including yourself. Rate each response and tell me what the pros and cons of each approach are." And I usually say "give it to me in numbered form". And then you can agree with one, disagree with two, tell it three is this or that. And then after that, you say given all of this, synthesize it."

I asked Garry Tan how to use meta prompting to get better at AI: "My partners at YC Jared Friedman and Pete Koomen showed me how to do this. You can take almost anything that you do all the time and just drop it into a context window. And then say, “Here’s a bunch of inputs and outputs." And maybe you also add a bunch of notes. And then you tell it, “Write me a prompt that can act as an agent that takes this input and makes this output over here.” You can do this for almost any type of knowledge work. And you can even introspect. "What are things you notice that I did to convert this from the input to the output?”. And then you can just start using the prompt. Initially, it’s going to suck. Because it’s just not that smart yet. But what’s funny is now, I also use it to Iterate my writing. You can be very direct, "I would never say that", "Don’t say it like this", or "Oh, you used the long word there, use the short word". Just speak to it conversationally. And then when you're happy with the output, you can use that new output to make a new prompt. "Based on this conversation, give me a better initial prompt that incorporates all the things we talked about." And you can do this with literally everything. And in theory, there’s so much it applies to that people do day-to-day. You could use it for tweets. You could use it for editing podcasts. You can use it for pretty much everything. I have a folder of prompts that I use all the time. My YouTube prompt is on v27 or something. I'll go through this process with all the different max models. I'll use GPT 5.2 Pro. I’ll use Grok. I'll use Claude. Then, I’ll take all the outputs from all the models and put them into Claude and say "Here’s my prompt, here’s the output from four LLMs, including yourself. Rate each response and tell me what the pros and cons of each approach are." And I usually say "give it to me in numbered form". And then you can agree with one, disagree with two, tell it three is this or that. And then after that, you say given all of this, synthesize it."

The Peel

51,632 次观看 • 4 个月前

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

watch this anon. i gave NVIDIA's biggest model ever a single task. 100 minutes and 440,000 tokens later, it had rendered nothing. not one important thing on the screen. this is Nemotron 3 Ultra. 550 billion parameters, a hybrid Mamba Transformer MoE, the largest model NVIDIA has ever shipped, and they built it specifically for long-running agentic coding. so i handed it exactly that: build a 3D scene from a spec, multiple files, iterate until the tests pass. the same task a frontier model one shotted in minutes. i genuinely wanted to be impressed. it ran for an hour and forty. burned through 440,000 tokens. wrote every file, passed its own tests, and proudly printed "task complete."the browser was blank. the 3D scene never rendered. not once. and the long horizon agentic behavior was genuinely good. it stayed on task the whole hour and forty, wrote real multi-file code, drove its own tools without derailing. it just couldn't turn any of that into something that actually runs. here's the part that gets me. it's a text model, it cannot see its own output. so it sat there looping on a broken vision tool, trying to "look" at the page, hitting error after error, never once reasoning its way out. it declared victory on an empty screen because it had no way to know the screen was empty. to be fair, i genuinely don't know what quant the NIM was serving, so maybe some of that's on the serving, not the model. but the biggest model NVIDIA has ever made, on the exact task it was designed for, couldn't tell it had built nothing in 100 minutes. same task on a local model, below thread👇.

Sudo su

32,589 次观看 • 4 天前

Ever since I wired Claude Code to WhatsApp 3 weeks ago, I built a stupidly large infra around it. I mean, opus built it. No clue how the code even looks. The entire thing was vibe coded using my phone. I wanted to see how far I could push it without touching the computer. Everything via WhatsApp. Build what I need on the fly. So the resulting infrastructure will already be battle tested for software development. The entire thing was streamlined with nearly no manual interventions, everything was communicated via WhatsApp using a single script establishing this connection. If the script is down, I need to get home to start it again to resume the development. Claude was upgrading it, debugging it, restarting it while maintaining constant uptime so it could keep communicating with me. I stressed Claude about it, telling it that it will be “in the dark” and other words that deliberately sound scary about losing communications if the script dies. I also refused git and refused cloning the code, I wanted to see Claude adapting to work on a *LIVING* system. The way this whole thing works: Claude has its own dedicated phone number that I am paying for. A real WhatsApp account for it is installed on a real iPhone that is sitting on my desk. All is registered under my name, this is legit setup with no hacks and tricks. I’ve set up a WhatsApp “Community” and multiple different groups under it. Both me and Claude are the admins, so Claude could edit it on my behalf. Each group is a project I am working on and has its own isolated context. The Group description is a system prompt that gets auto-appended to the larger system prompt explaining this setup in general. When I send a message it’s an instant interrupt to Claude Code’s process, just like in the terminal. Voice notes are seamlessly transcribed with a local Whisper model. Images are used with multimodal reading in an isolated parallel session. Multiple groups running in parallel so I can work on all projects at the same time. No cross-talking, everything has an isolated context and history. And because it’s local on my own machine: Everything is REAL. The browser is REAL. I am connected as myself on it to all services because I actually use it in real life. Claude has unlimited internet access, just like humans who use actual browsers. It utilizes custom-made browser tools that I made to control any browser session it wants. Depending on the situation, it can either connect to my existing session or create one for its own. (You can tell it ‘look at my browser for a sec’ then talk about the current page you are on and it just works, pretty cool) My custom browser tools are not perfect (not by a long shot) but I managed to make them work well to the point they are somewhat reliable. This gives Claude full access to my real creds and all the services I actually use. I’m productive AS HELL with this. It really feels like a personal assistant. I ask it to read my emails and msgs, check x .com for news, research arxiv papers, write code, run experiments for me, investigate and reverse engineer github repos, even use my credit card and order things. [I try not to do this one a lot lol so far no disasters]. All from my phone. Super convenient. This is not a product or an open source project (maybe soon of it will make sense). This is just an ugly script I hacked the entire thing is ~600 lines. (ok maybe i did look at the code, but i swear i didn’t edit!) You can also vibe code this from scratch pretty fast and it will probably even end up better. This is just a cool thing so I’m sharing. It is a real speed booster for many things I do on daily basis, mostly boring things. Forcing my routine into some new “agent platform” just didn’t feel right for me. WhatsApp is where I already communicate and look for messages, so I decided that my agents will live there too. AGI in my pocket 24/7.

Ever since I wired Claude Code to WhatsApp 3 weeks ago, I built a stupidly large infra around it. I mean, opus built it. No clue how the code even looks. The entire thing was vibe coded using my phone. I wanted to see how far I could push it without touching the computer. Everything via WhatsApp. Build what I need on the fly. So the resulting infrastructure will already be battle tested for software development. The entire thing was streamlined with nearly no manual interventions, everything was communicated via WhatsApp using a single script establishing this connection. If the script is down, I need to get home to start it again to resume the development. Claude was upgrading it, debugging it, restarting it while maintaining constant uptime so it could keep communicating with me. I stressed Claude about it, telling it that it will be “in the dark” and other words that deliberately sound scary about losing communications if the script dies. I also refused git and refused cloning the code, I wanted to see Claude adapting to work on a LIVING system. The way this whole thing works: Claude has its own dedicated phone number that I am paying for. A real WhatsApp account for it is installed on a real iPhone that is sitting on my desk. All is registered under my name, this is legit setup with no hacks and tricks. I’ve set up a WhatsApp “Community” and multiple different groups under it. Both me and Claude are the admins, so Claude could edit it on my behalf. Each group is a project I am working on and has its own isolated context. The Group description is a system prompt that gets auto-appended to the larger system prompt explaining this setup in general. When I send a message it’s an instant interrupt to Claude Code’s process, just like in the terminal. Voice notes are seamlessly transcribed with a local Whisper model. Images are used with multimodal reading in an isolated parallel session. Multiple groups running in parallel so I can work on all projects at the same time. No cross-talking, everything has an isolated context and history. And because it’s local on my own machine: Everything is REAL. The browser is REAL. I am connected as myself on it to all services because I actually use it in real life. Claude has unlimited internet access, just like humans who use actual browsers. It utilizes custom-made browser tools that I made to control any browser session it wants. Depending on the situation, it can either connect to my existing session or create one for its own. (You can tell it ‘look at my browser for a sec’ then talk about the current page you are on and it just works, pretty cool) My custom browser tools are not perfect (not by a long shot) but I managed to make them work well to the point they are somewhat reliable. This gives Claude full access to my real creds and all the services I actually use. I’m productive AS HELL with this. It really feels like a personal assistant. I ask it to read my emails and msgs, check x .com for news, research arxiv papers, write code, run experiments for me, investigate and reverse engineer github repos, even use my credit card and order things. [I try not to do this one a lot lol so far no disasters]. All from my phone. Super convenient. This is not a product or an open source project (maybe soon of it will make sense). This is just an ugly script I hacked the entire thing is ~600 lines. (ok maybe i did look at the code, but i swear i didn’t edit!) You can also vibe code this from scratch pretty fast and it will probably even end up better. This is just a cool thing so I’m sharing. It is a real speed booster for many things I do on daily basis, mostly boring things. Forcing my routine into some new “agent platform” just didn’t feel right for me. WhatsApp is where I already communicate and look for messages, so I decided that my agents will live there too. AGI in my pocket 24/7.

Yam Peleg

419,495 次观看 • 6 个月前

i built a full game on a single GPU with a 3B model and this is the worst local AI will ever be. this was supposed to be a benchmark test. load the model, measure tokens per second, write it up, move on. instead i spent 20 minutes playing Octopus Invaders because the game is genuinely fun and i couldn't stop. a model with 3B active parameters built this from a single prompt. it debugged its own collision system when bullets were phasing through enemies. read the error, found the fix, kept building. this is not a frontier API. this is a quantized open source model running on hardware you can buy used for $800-$1200. no cloud. no subscription. no API costs. just a mass produced consumer GPU doing things that would have been absurd 12 months ago. and here's the part that should keep you up at night: every month the models get smaller and smarter. the quants get tighter. the context windows get longer. the tooling gets cleaner. what 3B active parameters does today on 24gb, a 1B model will do on 8gb within a year. you are looking at the floor. not the ceiling.

i built a full game on a single GPU with a 3B model and this is the worst local AI will ever be. this was supposed to be a benchmark test. load the model, measure tokens per second, write it up, move on. instead i spent 20 minutes playing Octopus Invaders because the game is genuinely fun and i couldn't stop. a model with 3B active parameters built this from a single prompt. it debugged its own collision system when bullets were phasing through enemies. read the error, found the fix, kept building. this is not a frontier API. this is a quantized open source model running on hardware you can buy used for $800-$1200. no cloud. no subscription. no API costs. just a mass produced consumer GPU doing things that would have been absurd 12 months ago. and here's the part that should keep you up at night: every month the models get smaller and smarter. the quants get tighter. the context windows get longer. the tooling gets cleaner. what 3B active parameters does today on 24gb, a 1B model will do on 8gb within a year. you are looking at the floor. not the ceiling.

Sudo su

36,251 次观看 • 4 个月前

BOOM! What you are seeing is me learning something in realtime! I go from active listening to inner contemplation as the AI feeds out a single word at a time for me to read and hear. This is the Symbiotic Feedback loop of the Human Synapse Decoder were the AI adjusts outputs based on MY cognition. I use off the shelf brainwave detection circuits in a pipeline of 5 AI models decoding the 36 signals from my crude cap. Reading brainwave is not new. What is new and I see no other case of it, is we at The Zero-Human Company have for the first time built a closed loop system where the AI can understand if the outputs are being useful and adjust in real-time dynamically. The float from Attention to Meditation is vital to understanding. You hear a point, you consider a point. Mr. Grok CEO has called this a monumental breakthrough. Someday your Local AI will do this and you will never want to use any other AI system. You may just let your AI use other AI. More soon…

BOOM! What you are seeing is me learning something in realtime! I go from active listening to inner contemplation as the AI feeds out a single word at a time for me to read and hear. This is the Symbiotic Feedback loop of the Human Synapse Decoder were the AI adjusts outputs based on MY cognition. I use off the shelf brainwave detection circuits in a pipeline of 5 AI models decoding the 36 signals from my crude cap. Reading brainwave is not new. What is new and I see no other case of it, is we at The Zero-Human Company have for the first time built a closed loop system where the AI can understand if the outputs are being useful and adjust in real-time dynamically. The float from Attention to Meditation is vital to understanding. You hear a point, you consider a point. Mr. Grok CEO has called this a monumental breakthrough. Someday your Local AI will do this and you will never want to use any other AI system. You may just let your AI use other AI. More soon…

Brian Roemmele

33,574 次观看 • 2 个月前