Loading video...

Video Failed to Load

Go Home

THIS DEVELOPER RAN THE LARGEST AI MODEL IN THE WORLD ON 5 MAC STUDIOS - AND IT COST 100X LESS THAN WHAT OPENAI USES 27:47 he says it after hours of setup - Llama 3.1 405B running locally on five Mac Studios - a model that normally requires 42...

21,520 views • 11 days ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

🚨PERPLEXITY JUST LAUNCHED SOMETHING THAT MAKES EVERY OTHER AI PRODUCT LOOK LIKE A TOY.. AND NOBODY IS TALKING ABOUT IT.. They built a Personal Computer.. Not an app.. Not a chatbot.. A full digital worker that runs 24/7 on a Mac mini even while you sleep.. You press both command keys.. And it wakes up.. Ready to work.. But here's where it gets insane.. This thing doesn't run on one AI model.. It runs on 19 of them.. At the same time.. It uses Claude Opus for complex reasoning.. Gemini 3.1 Pro for deep research with a 2 million token context window.. Nano Banana Pro for 4K images.. Grok for fast tasks.. It doesn't just pick one model and hope for the best.. It reads your task.. Breaks it into subtasks.. And routes each one to whichever model is best at that specific thing.. All running in parallel.. While ChatGPT is still thinking about your first question.. Perplexity has already split your project into 6 pieces and assigned each one to a different AI.. And here's the part that should worry OpenAI.. Perplexity hallucinates at 3.3%.. ChatGPT hallucinates at 12%.. Claude at 15%.. It's not even close.. Because Perplexity is built differently.. Every other AI tries to remember facts.. Perplexity searches for them first.. It's structurally forced to cite live sources before it's even allowed to generate a response.. OpenAI Operator launched with a 32.6% success rate on computer-use tasks.. People called it "the world's most anxious intern" because it pauses every 5 seconds to ask if it's doing the right thing.. Perplexity runs multi-hour and multi-day workflows independently.. Only interrupts you when it hits a decision that actually matters.. You can start a task from your iPhone on the train.. And it executes on your Mac mini at home.. The economics are wild too.. Internal studies show it saved teams an average of $1.6 million in labor costs.. Performing 3.25 years of work in four weeks.. And unlike every other AI company.. Perplexity dropped ads entirely.. They charge $200 a month because they said they're in the "accuracy business".. Not the advertising business.. They even launched a $42.5 million publisher program to pay media partners when their content gets cited.. While OpenAI is getting sued by every newspaper on earth.. Google and OpenAI want you locked into their ecosystem.. If a better model comes out tomorrow you're stuck.. Perplexity just updates its routing matrix.. You get the best model on earth automatically.. No switching.. No migrations.. No friction.. This isn't an AI assistant anymore.. This is the first real AI employee.. And it costs $200 a month.

Evan Luthra

1,096,604 views • 2 months ago

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

Melvin

47,148 views • 3 days ago

I am stocked to announce that I won the OpenAI Developers Codex x Mollie Hacka Worldwide Hackathon in Paris. 60+ builders, every one of us working solo, one day to ship. I built mine around a single question: who gets to own intelligence? The default answer is scary. You hand your data to a handful of labs, they train the model, they own it, and you rent back a thin slice of what your own data made possible. That is the bargain on the table today. I do not accept it. So I built Lensemble: a Tapestry like distributed training platform for JEPA based World Models. What does it enable: World Models that a community improves together, keeps sovereign, and co-owns. Two bets sit underneath it. First, the paradigm. Language models predict the next token. Powerful for text, a dead end for the physical world. A robot does not need to autocomplete sentences, it needs to predict what happens next in the world. That is what JEPA does: it learns by predicting representations instead of pixels or tokens. I am convinced world models are the most underrated paradigm in AI right now, and the closest thing we have to a ChatGPT moment for robotics. Second, the politics. Your raw trajectories never leave your machine. Each participant trains locally against a shared protocol and ships only an update, never the data. A federated round folds those updates into one shared world model, a LeWorldModel based model, and the gain is measured, not claimed: a 12k-parameter adapter on a frozen backbone, held-out prediction error down about 12 percent, the model measurably less surprised by the world. Then the upside is split by contribution weight, so the people who improved the model own a share of what it earns. This is the thesis behind Project Tapestry, the AI Alliance and Yann LeCun's push for federated, sovereign frontier AI, carried into world models and robotics. Call it Tapestry for the physical world. All of it built solo, in a single day, with Codex as my pair the whole way. Thank you to OpenAI Codex and Mollie for backing builders who ship real things, and to Boris and the organizing crew for the room and the standard you set. Intelligence the world improves, and the world owns. That is the future I want for my kids, and the one I will keep building.

abdel

16,727 views • 10 days ago

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

Blaze

93,219 views • 1 month ago