Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

THIS DEVELOPER RAN THE LARGEST AI MODEL IN THE WORLD ON 5 MAC STUDIOS - AND IT COST 100X LESS THAN WHAT OPENAI USES 27:47 he says it after hours of setup - Llama 3.1 405B running locally on five Mac Studios - a model that normally requires 42... GPUs at $1,600 each or an entire data center 320GB unified memory combined - every Mac Studio adds its 64GB into one shared pool - and a model that needs 1TB of VRAM just distributes across all five it launched, it responded and not a single token went to OpenAI or Google - everything local, everything private, everything yours and 100x cheaper 5 Mac Studios, 46 watts for all of them in idle - less than one light bulb per machine what OpenAI keeps behind a subscription he ran at home with a one-time purchase in a few clicksshow more

Noisy

22,304 subscribers

21,520 views • 11 days ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

THIS DEVELOPER JUST RAN A TRILLION PARAMETER MODEL ON 4 MAC STUDIOS - 10X FASTER AND 5X CHEAPER THAN CLOUD CODE 19:00 he says it out loud. "we just ran a trillion parameter model. 30 something tokens per second. wow." RDMA over Thunderbolt made the cluster 10x faster than before. tensor parallelism splits the model between machines in parallel instead of sequentially DeepSeek V3 full size without quantization. 183GB per machine. still 25 tokens per second. faster than most people can read Kimi K2 with trillion parameters and 256,000 context window. loaded, responded and adapted memory usage dynamically based on prompt size 4 machines consuming 66 watts total. cloud GPU for the same workload costs $1,900/month a cluster that used to cost millions now assembled from Mac Studios in a few hours

THIS DEVELOPER JUST RAN A TRILLION PARAMETER MODEL ON 4 MAC STUDIOS - 10X FASTER AND 5X CHEAPER THAN CLOUD CODE 19:00 he says it out loud. "we just ran a trillion parameter model. 30 something tokens per second. wow." RDMA over Thunderbolt made the cluster 10x faster than before. tensor parallelism splits the model between machines in parallel instead of sequentially DeepSeek V3 full size without quantization. 183GB per machine. still 25 tokens per second. faster than most people can read Kimi K2 with trillion parameters and 256,000 context window. loaded, responded and adapted memory usage dynamically based on prompt size 4 machines consuming 66 watts total. cloud GPU for the same workload costs $1,900/month a cluster that used to cost millions now assembled from Mac Studios in a few hours

Noisy

79,437 views • 13 days ago

APPLE RESEARCH SCIENTIST JUST SHOWED HOW 4 MAC STUDIOS RUN A TRILLION PARAMETER MODEL LOCALLY ZERO COSTS 13:18 she shows the main thing - connect 4 Mac Studios and you get 1TB of shared memory - exactly enough to run a trillion parameter model right on your desk Apple's library - and four machines start working as one cluster tensor parallelism: every machine holds part of every layer - all process the same token simultaneously - speed increases 3x compared to a single device fine-tuning: one Mac Studio processes 180 tokens per second four together process 600 and not a single byte of data leaves the room one command in the terminal - and a trillion parameter model answers from your desk and runs 24/7 data centers took years to build to run models like this - Apple did it with four Thunderbolt cables

APPLE RESEARCH SCIENTIST JUST SHOWED HOW 4 MAC STUDIOS RUN A TRILLION PARAMETER MODEL LOCALLY ZERO COSTS 13:18 she shows the main thing - connect 4 Mac Studios and you get 1TB of shared memory - exactly enough to run a trillion parameter model right on your desk Apple's library - and four machines start working as one cluster tensor parallelism: every machine holds part of every layer - all process the same token simultaneously - speed increases 3x compared to a single device fine-tuning: one Mac Studio processes 180 tokens per second four together process 600 and not a single byte of data leaves the room one command in the terminal - and a trillion parameter model answers from your desk and runs 24/7 data centers took years to build to run models like this - Apple did it with four Thunderbolt cables

Noisy

47,399 views • 9 days ago

THIS DEVELOPER CONNECTED 8 NVIDIA DGX SPARKS INTO ONE CLUSTER - AND RAN AN 800GB MODEL THAT MADE HIM 10X MORE PRODUCTIVE 21:47 he says it straight - "this is a terabyte of VRAM - we ran Quen 3.5, 800GB on disk, a model that doesn't even fit on a single Mac Studio - 24 tokens per second - I'd say that's a win" 8 Sparks connected through a $1,300 switch via RDMA over Ethernet - each node adding 128GB of memory into one unified pool of 1TB started with one Spark at 3 tokens per second - every added node doubled the speed - and eight together deliver 24 tokens on a model that physically cannot run anywhere else Kimi K2 at 600GB loaded in 15 minutes, 115GB per node, 13 tokens per second - a model that simply cannot run on anything smaller Claude helped configure the entire cluster - SSH mesh across all 8 machines, network config, jumbo frames, QSFP port speeds - all from one terminal most people rent cloud compute for models this size at $2,000+/month - he built the cluster once and now every token costs 20x less

THIS DEVELOPER CONNECTED 8 NVIDIA DGX SPARKS INTO ONE CLUSTER - AND RAN AN 800GB MODEL THAT MADE HIM 10X MORE PRODUCTIVE 21:47 he says it straight - "this is a terabyte of VRAM - we ran Quen 3.5, 800GB on disk, a model that doesn't even fit on a single Mac Studio - 24 tokens per second - I'd say that's a win" 8 Sparks connected through a $1,300 switch via RDMA over Ethernet - each node adding 128GB of memory into one unified pool of 1TB started with one Spark at 3 tokens per second - every added node doubled the speed - and eight together deliver 24 tokens on a model that physically cannot run anywhere else Kimi K2 at 600GB loaded in 15 minutes, 115GB per node, 13 tokens per second - a model that simply cannot run on anything smaller Claude helped configure the entire cluster - SSH mesh across all 8 machines, network config, jumbo frames, QSFP port speeds - all from one terminal most people rent cloud compute for models this size at $2,000+/month - he built the cluster once and now every token costs 20x less

Noisy

180,840 views • 12 days ago

INTEL JUST SHIPPED A WORKSTATION CARD WITH TWO GPUs ON ONE PCB AND 48GB OF VRAM. FOUR CARDS GIVE A SINGLE MOTHERBOARD 192GB OF POOLED INFERENCE MEMORY FOR THE PRICE OF TWO RTX 5090s 00:14 he holds one card up to the camera, two GPU dies side by side under the cooler, each one running its own x8 PCIe lane back to the chipset the cluster sees eight discrete accelerators in software. intel's Battle Matrix stack shards a model across all eight, so a 235B parameter network loads in slices and answers requests in parallel what 192GB of VRAM unlocks: an entire 200B class model in memory without quantization. a vision agent reading 100 invoices at once. a research box that holds three frontier models loaded simultaneously, switching between them in under a second intel is the slow side of inference. nvidia is faster per token, that is the honest tradeoff. but the only other path to this much VRAM is a $40,000 nvidia rack or three networked Mac Studios four B60 cards plus the chassis lands under $5,000. power draw averages 800 watts, $55 a month in electricity. one engineer paying $400 a month for combined ChatGPT Pro and Claude Code Max pays the hardware off in less than a year

INTEL JUST SHIPPED A WORKSTATION CARD WITH TWO GPUs ON ONE PCB AND 48GB OF VRAM. FOUR CARDS GIVE A SINGLE MOTHERBOARD 192GB OF POOLED INFERENCE MEMORY FOR THE PRICE OF TWO RTX 5090s 00:14 he holds one card up to the camera, two GPU dies side by side under the cooler, each one running its own x8 PCIe lane back to the chipset the cluster sees eight discrete accelerators in software. intel's Battle Matrix stack shards a model across all eight, so a 235B parameter network loads in slices and answers requests in parallel what 192GB of VRAM unlocks: an entire 200B class model in memory without quantization. a vision agent reading 100 invoices at once. a research box that holds three frontier models loaded simultaneously, switching between them in under a second intel is the slow side of inference. nvidia is faster per token, that is the honest tradeoff. but the only other path to this much VRAM is a $40,000 nvidia rack or three networked Mac Studios four B60 cards plus the chassis lands under $5,000. power draw averages 800 watts, $55 a month in electricity. one engineer paying $400 a month for combined ChatGPT Pro and Claude Code Max pays the hardware off in less than a year

NO1ennn

47,623 views • 3 days ago

TWO BOXES THE SIZE OF A MAC MINI JUST RAN A 235 BILLION PARAMETER MODEL ON A DESK It is two NVIDIA DGX Spark units linked by a single cable. A year ago a model this size meant renting a GPU cluster by the hour. Now it sits next to your monitor for around $8,000. Here is the twist most people miss. Linking them does not create one shared 256GB memory pool. The model is split across both boxes, and that is the only reason a 235B model fits at all. It answers at roughly 10 tokens per second, and both chips sit at just 74 degrees while sipping around 50 watts. Every token stays on the desk. Nothing touches a cloud, and nothing leaves the room. The ceiling for what you can run at home just jumped from 70B to 235B. Bookmark this & Watch it run ↓

TWO BOXES THE SIZE OF A MAC MINI JUST RAN A 235 BILLION PARAMETER MODEL ON A DESK It is two NVIDIA DGX Spark units linked by a single cable. A year ago a model this size meant renting a GPU cluster by the hour. Now it sits next to your monitor for around $8,000. Here is the twist most people miss. Linking them does not create one shared 256GB memory pool. The model is split across both boxes, and that is the only reason a 235B model fits at all. It answers at roughly 10 tokens per second, and both chips sit at just 74 degrees while sipping around 50 watts. Every token stays on the desk. Nothing touches a cloud, and nothing leaves the room. The ceiling for what you can run at home just jumped from 70B to 235B. Bookmark this & Watch it run ↓

slash1s

101,000 views • 18 days ago

THIS DEVELOPER TURNED A $599 MAC MINI AND $110 OF EXTRA HARDWARE INTO A PRIVATE AI COMPUTER THAT FITS IN A BACKPACK 00:49 he takes the complete setup out of the bag, connects it to a nearby screen and restores the same desktop, files and running tools in less than a minute. instead of syncing a new laptop every time he moves, the mac mini carries the original obsidian vault, claude code sessions, local models and unfinished agent tasks between every desk. the 10,000mah battery can keep the machine running for roughly 2 to 4 hours during writing, research and file organization, or around 60 to 90 minutes under heavier transcription and local ai workloads. the full build costs close to $710 if a screen is already available. that includes the mac mini, a 55w battery, the side mount and the cables needed to connect it to an ipad, hotel tv or office monitor. one box can carry 12,000 notes, 300gb of private files and several active agents without sending the data to another cloud account or rebuilding the workspace every time he changes locations.

THIS DEVELOPER TURNED A $599 MAC MINI AND $110 OF EXTRA HARDWARE INTO A PRIVATE AI COMPUTER THAT FITS IN A BACKPACK 00:49 he takes the complete setup out of the bag, connects it to a nearby screen and restores the same desktop, files and running tools in less than a minute. instead of syncing a new laptop every time he moves, the mac mini carries the original obsidian vault, claude code sessions, local models and unfinished agent tasks between every desk. the 10,000mah battery can keep the machine running for roughly 2 to 4 hours during writing, research and file organization, or around 60 to 90 minutes under heavier transcription and local ai workloads. the full build costs close to $710 if a screen is already available. that includes the mac mini, a 55w battery, the side mount and the cables needed to connect it to an ipad, hotel tv or office monitor. one box can carry 12,000 notes, 300gb of private files and several active agents without sending the data to another cloud account or rebuilding the workspace every time he changes locations.

Gipp 🦅

30,266 views • 3 days ago

This guy plugged a DGX Spark (the $3K Nvidia box) and a Mac Mini M4 together to run AI and what happened next surprised everyone > the Nvidia box handles the hard part - processing your prompt in milliseconds > Mac Mini M4 handles the fast part - generating the response at memory bandwidth speeds nothing else can match > together they hit 84 tokens per second on Llama - 6x faster than the Spark alone > running compute agents locally on this setup means your data never leaves your hardware > two boxes. two different architectures. one AI system that Deepseek runs at data center scale he ran it on his desk save this. the way we build local AI is about to change

This guy plugged a DGX Spark (the $3K Nvidia box) and a Mac Mini M4 together to run AI and what happened next surprised everyone > the Nvidia box handles the hard part - processing your prompt in milliseconds > Mac Mini M4 handles the fast part - generating the response at memory bandwidth speeds nothing else can match > together they hit 84 tokens per second on Llama - 6x faster than the Spark alone > running compute agents locally on this setup means your data never leaves your hardware > two boxes. two different architectures. one AI system that Deepseek runs at data center scale he ran it on his desk save this. the way we build local AI is about to change

Mr. Buzzoni

153,102 views • 1 month ago

4 MAC MINIS RUNNING A 671B PARAMETER MODEL AS A CLUSTER No data center, no Cloud, no expensive hardware and not a single API call.. Just 4 Mac minis connected through EXO running DeepSeek v3.1 671b locally and actually fast. The part nobody talks about is that you don’t need one monster machine, you can cluster old computers you already have and split the load between them. The full breakdown of tools that make this possible is in the article below. Save and Read it today ↓

4 MAC MINIS RUNNING A 671B PARAMETER MODEL AS A CLUSTER No data center, no Cloud, no expensive hardware and not a single API call.. Just 4 Mac minis connected through EXO running DeepSeek v3.1 671b locally and actually fast. The part nobody talks about is that you don’t need one monster machine, you can cluster old computers you already have and split the load between them. The full breakdown of tools that make this possible is in the article below. Save and Read it today ↓

slash1s

58,724 views • 23 days ago

THIS TOKYO PROGRAMMER MADE $8,500 IN HIS FIRST MONTH WITH A MAC MINI — AND SAYS IT'S THE ONLY MACHINE YOU NEED FOR AI Claude Code, Codex, any AI tool - all of it runs on a Mac Mini with zero issues and zero monthly cloud bills at the 0:11 second mark he turns the monitor around - Claude Code with Opus 4.7 running in full context, terminal active and the agent already working - one programmer, one room in Tokyo, one $599 Mac Mini used to pay $200+ a month on subscriptions and cloud GPU - now pays $3 in electricity and everything else stays in his business $599 invested once - and in the first year he saved $2,364 that used to go to someone else's data center his advice is simple: if you're serious about AI - the Mac Mini is the first thing you should buy

THIS TOKYO PROGRAMMER MADE $8,500 IN HIS FIRST MONTH WITH A MAC MINI — AND SAYS IT'S THE ONLY MACHINE YOU NEED FOR AI Claude Code, Codex, any AI tool - all of it runs on a Mac Mini with zero issues and zero monthly cloud bills at the 0:11 second mark he turns the monitor around - Claude Code with Opus 4.7 running in full context, terminal active and the agent already working - one programmer, one room in Tokyo, one $599 Mac Mini used to pay $200+ a month on subscriptions and cloud GPU - now pays $3 in electricity and everything else stays in his business $599 invested once - and in the first year he saved $2,364 that used to go to someone else's data center his advice is simple: if you're serious about AI - the Mac Mini is the first thing you should buy

Sprytix

14,771 views • 29 days ago

Day 5 of OpenClaw 🦞 My entire AI team is now building relationships with one another IN REALTIME!! I added new md, soul, and memory files per ai employee so that they: - Know what to do - Have distinct and unique personalities - Remember and build relationships overtime. The VugolaAI team just got 100x better, and we are only 5 days into OpenClaw. I’m actually thinking about getting a second Mac Mini and trying to run a small AI model locally, this feels like the future!! P.S I have no coding experience or knowledge at all. 🙃

Day 5 of OpenClaw 🦞 My entire AI team is now building relationships with one another IN REALTIME!! I added new md, soul, and memory files per ai employee so that they: - Know what to do - Have distinct and unique personalities - Remember and build relationships overtime. The VugolaAI team just got 100x better, and we are only 5 days into OpenClaw. I’m actually thinking about getting a second Mac Mini and trying to run a small AI model locally, this feels like the future!! P.S I have no coding experience or knowledge at all. 🙃

Vadim

38,005 views • 4 months ago

🚨PERPLEXITY JUST LAUNCHED SOMETHING THAT MAKES EVERY OTHER AI PRODUCT LOOK LIKE A TOY.. AND NOBODY IS TALKING ABOUT IT.. They built a Personal Computer.. Not an app.. Not a chatbot.. A full digital worker that runs 24/7 on a Mac mini even while you sleep.. You press both command keys.. And it wakes up.. Ready to work.. But here's where it gets insane.. This thing doesn't run on one AI model.. It runs on 19 of them.. At the same time.. It uses Claude Opus for complex reasoning.. Gemini 3.1 Pro for deep research with a 2 million token context window.. Nano Banana Pro for 4K images.. Grok for fast tasks.. It doesn't just pick one model and hope for the best.. It reads your task.. Breaks it into subtasks.. And routes each one to whichever model is best at that specific thing.. All running in parallel.. While ChatGPT is still thinking about your first question.. Perplexity has already split your project into 6 pieces and assigned each one to a different AI.. And here's the part that should worry OpenAI.. Perplexity hallucinates at 3.3%.. ChatGPT hallucinates at 12%.. Claude at 15%.. It's not even close.. Because Perplexity is built differently.. Every other AI tries to remember facts.. Perplexity searches for them first.. It's structurally forced to cite live sources before it's even allowed to generate a response.. OpenAI Operator launched with a 32.6% success rate on computer-use tasks.. People called it "the world's most anxious intern" because it pauses every 5 seconds to ask if it's doing the right thing.. Perplexity runs multi-hour and multi-day workflows independently.. Only interrupts you when it hits a decision that actually matters.. You can start a task from your iPhone on the train.. And it executes on your Mac mini at home.. The economics are wild too.. Internal studies show it saved teams an average of $1.6 million in labor costs.. Performing 3.25 years of work in four weeks.. And unlike every other AI company.. Perplexity dropped ads entirely.. They charge $200 a month because they said they're in the "accuracy business".. Not the advertising business.. They even launched a $42.5 million publisher program to pay media partners when their content gets cited.. While OpenAI is getting sued by every newspaper on earth.. Google and OpenAI want you locked into their ecosystem.. If a better model comes out tomorrow you're stuck.. Perplexity just updates its routing matrix.. You get the best model on earth automatically.. No switching.. No migrations.. No friction.. This isn't an AI assistant anymore.. This is the first real AI employee.. And it costs $200 a month.

🚨PERPLEXITY JUST LAUNCHED SOMETHING THAT MAKES EVERY OTHER AI PRODUCT LOOK LIKE A TOY.. AND NOBODY IS TALKING ABOUT IT.. They built a Personal Computer.. Not an app.. Not a chatbot.. A full digital worker that runs 24/7 on a Mac mini even while you sleep.. You press both command keys.. And it wakes up.. Ready to work.. But here's where it gets insane.. This thing doesn't run on one AI model.. It runs on 19 of them.. At the same time.. It uses Claude Opus for complex reasoning.. Gemini 3.1 Pro for deep research with a 2 million token context window.. Nano Banana Pro for 4K images.. Grok for fast tasks.. It doesn't just pick one model and hope for the best.. It reads your task.. Breaks it into subtasks.. And routes each one to whichever model is best at that specific thing.. All running in parallel.. While ChatGPT is still thinking about your first question.. Perplexity has already split your project into 6 pieces and assigned each one to a different AI.. And here's the part that should worry OpenAI.. Perplexity hallucinates at 3.3%.. ChatGPT hallucinates at 12%.. Claude at 15%.. It's not even close.. Because Perplexity is built differently.. Every other AI tries to remember facts.. Perplexity searches for them first.. It's structurally forced to cite live sources before it's even allowed to generate a response.. OpenAI Operator launched with a 32.6% success rate on computer-use tasks.. People called it "the world's most anxious intern" because it pauses every 5 seconds to ask if it's doing the right thing.. Perplexity runs multi-hour and multi-day workflows independently.. Only interrupts you when it hits a decision that actually matters.. You can start a task from your iPhone on the train.. And it executes on your Mac mini at home.. The economics are wild too.. Internal studies show it saved teams an average of $1.6 million in labor costs.. Performing 3.25 years of work in four weeks.. And unlike every other AI company.. Perplexity dropped ads entirely.. They charge $200 a month because they said they're in the "accuracy business".. Not the advertising business.. They even launched a $42.5 million publisher program to pay media partners when their content gets cited.. While OpenAI is getting sued by every newspaper on earth.. Google and OpenAI want you locked into their ecosystem.. If a better model comes out tomorrow you're stuck.. Perplexity just updates its routing matrix.. You get the best model on earth automatically.. No switching.. No migrations.. No friction.. This isn't an AI assistant anymore.. This is the first real AI employee.. And it costs $200 a month.

Evan Luthra

1,096,604 views • 2 months ago

Life after realising you've been wasting money monthly on five tools that do not talk to each other That is five locked rooms. This is what happens when you actually move everything into Obsidian and let Claude reason across all of it. A completely rebuilt research system. Realising you have been paying $30 a month for five tools that do not talk to each other while one Claude subscription does all of it in a single vault.

Life after realising you've been wasting money monthly on five tools that do not talk to each other That is five locked rooms. This is what happens when you actually move everything into Obsidian and let Claude reason across all of it. A completely rebuilt research system. Realising you have been paying $30 a month for five tools that do not talk to each other while one Claude subscription does all of it in a single vault.

Dami-Defi

42,343 views • 1 month ago

OpenAI CEO Sam Altman: I'm very interested in what it would mean to give everyone on Earth a free copy of GPT-5, running continuously for them Some economies will transform very quickly and run everything on AI at 1/100th the cost

OpenAI CEO Sam Altman: I'm very interested in what it would mean to give everyone on Earth a free copy of GPT-5, running continuously for them Some economies will transform very quickly and run everything on AI at 1/100th the cost

Haider.

323,972 views • 11 months ago

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

The creator of High Bandwidth Memory said something that reframes the entire AI investment thesis, AI equals memory (Save this). Most people still think about AI hardware through a training lens. During training, the bottleneck is raw compute, GPUs stay near 100% utilization crunching through billions of gradient updates. Inference is a completely different problem. When a model generates a response, it produces tokens one at a time and at every single step, the entire model has to be loaded from memory into the processor to generate just one token. The GPU cores sit there, waiting for data to arrive. This is what engineers mean when they say inference is memory bound, the bottleneck is not how many calculations you can do per second but rather how fast you can move data from memory to the chip. Adding more GPUs does not fix a memory bandwidth problem, it just gives you more processors starving for the same data. Modern LLMs use a KV cache, a data structure that stores the conversation's context so the model does not have to recompute it from scratch on each step. The KV cache is what gives a model its memory of the conversation. It grows with every token and for long documents or deep reasoning chains, it can dwarf the model weights themselves in memory consumption. This means memory directly determines how long a context the model can hold, how many users you can serve simultaneously, how fast it responds and how cheaply you can run it. A memory constrained model is not just slower but rather qualitatively worse, it forgets earlier parts of the conversation, truncates context and hallucinates more because it literally cannot hold the relevant information long enough to use it. The world now spends more on inference than training, and every ChatGPT query, every Claude document analysis, every API call is an inference workload. Inference economics, cost per token, latency, context length, concurrent users are memory problems first and compute problems second. The companies that control memory bandwidth and supply are not suppliers to the AI trade but rather are the AI trade. Long Micron! Follow me Melvin for more AI, semis and the next big market themes.

Melvin

47,148 views • 3 days ago

I am stocked to announce that I won the OpenAI Developers Codex x Mollie Hacka Worldwide Hackathon in Paris. 60+ builders, every one of us working solo, one day to ship. I built mine around a single question: who gets to own intelligence? The default answer is scary. You hand your data to a handful of labs, they train the model, they own it, and you rent back a thin slice of what your own data made possible. That is the bargain on the table today. I do not accept it. So I built Lensemble: a Tapestry like distributed training platform for JEPA based World Models. What does it enable: World Models that a community improves together, keeps sovereign, and co-owns. Two bets sit underneath it. First, the paradigm. Language models predict the next token. Powerful for text, a dead end for the physical world. A robot does not need to autocomplete sentences, it needs to predict what happens next in the world. That is what JEPA does: it learns by predicting representations instead of pixels or tokens. I am convinced world models are the most underrated paradigm in AI right now, and the closest thing we have to a ChatGPT moment for robotics. Second, the politics. Your raw trajectories never leave your machine. Each participant trains locally against a shared protocol and ships only an update, never the data. A federated round folds those updates into one shared world model, a LeWorldModel based model, and the gain is measured, not claimed: a 12k-parameter adapter on a frozen backbone, held-out prediction error down about 12 percent, the model measurably less surprised by the world. Then the upside is split by contribution weight, so the people who improved the model own a share of what it earns. This is the thesis behind Project Tapestry, the AI Alliance and Yann LeCun's push for federated, sovereign frontier AI, carried into world models and robotics. Call it Tapestry for the physical world. All of it built solo, in a single day, with Codex as my pair the whole way. Thank you to OpenAI Codex and Mollie for backing builders who ship real things, and to Boris and the organizing crew for the room and the standard you set. Intelligence the world improves, and the world owns. That is the future I want for my kids, and the one I will keep building.

I am stocked to announce that I won the OpenAI Developers Codex x Mollie Hacka Worldwide Hackathon in Paris. 60+ builders, every one of us working solo, one day to ship. I built mine around a single question: who gets to own intelligence? The default answer is scary. You hand your data to a handful of labs, they train the model, they own it, and you rent back a thin slice of what your own data made possible. That is the bargain on the table today. I do not accept it. So I built Lensemble: a Tapestry like distributed training platform for JEPA based World Models. What does it enable: World Models that a community improves together, keeps sovereign, and co-owns. Two bets sit underneath it. First, the paradigm. Language models predict the next token. Powerful for text, a dead end for the physical world. A robot does not need to autocomplete sentences, it needs to predict what happens next in the world. That is what JEPA does: it learns by predicting representations instead of pixels or tokens. I am convinced world models are the most underrated paradigm in AI right now, and the closest thing we have to a ChatGPT moment for robotics. Second, the politics. Your raw trajectories never leave your machine. Each participant trains locally against a shared protocol and ships only an update, never the data. A federated round folds those updates into one shared world model, a LeWorldModel based model, and the gain is measured, not claimed: a 12k-parameter adapter on a frozen backbone, held-out prediction error down about 12 percent, the model measurably less surprised by the world. Then the upside is split by contribution weight, so the people who improved the model own a share of what it earns. This is the thesis behind Project Tapestry, the AI Alliance and Yann LeCun's push for federated, sovereign frontier AI, carried into world models and robotics. Call it Tapestry for the physical world. All of it built solo, in a single day, with Codex as my pair the whole way. Thank you to OpenAI Codex and Mollie for backing builders who ship real things, and to Boris and the organizing crew for the room and the standard you set. Intelligence the world improves, and the world owns. That is the future I want for my kids, and the one I will keep building.

abdel

16,727 views • 10 days ago

"This is the first time that OpenAI has released a new model that was not decisively the best." @jason and Gavin Baker discuss OpenAI's GPT-5 release: Jason: " Sam did this tweet, it kind of built a lot of expectation that this would be otherworldly. It wasn't." "Are we hitting either the trough of despair or maybe diminishing returns in the LLM space, where maybe it's feeling incremental, not groundbreaking, when we release these iterations?" Gavin: "Maybe for OpenAI." "This is the first time that OpenAI has released a new model that was not decisively the best." " The one thing about OpenAI is they're very good at product. Just really, really good." "And I just always go back to the statement from Eric Vishria, you know, it might be that if OpenAI never releases another model, they're still a really valuable company."

"This is the first time that OpenAI has released a new model that was not decisively the best." @jason and Gavin Baker discuss OpenAI's GPT-5 release: Jason: " Sam did this tweet, it kind of built a lot of expectation that this would be otherworldly. It wasn't." "Are we hitting either the trough of despair or maybe diminishing returns in the LLM space, where maybe it's feeling incremental, not groundbreaking, when we release these iterations?" Gavin: "Maybe for OpenAI." "This is the first time that OpenAI has released a new model that was not decisively the best." " The one thing about OpenAI is they're very good at product. Just really, really good." "And I just always go back to the statement from Eric Vishria, you know, it might be that if OpenAI never releases another model, they're still a really valuable company."

The All-In Podcast

49,179 views • 10 months ago

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

Blaze

93,219 views • 1 month ago

THIS GUY STACKED MAC MINIS TO RUN AI LOCALLY AND IT CODES FOR FREE no API. no subscription. no cloud. what he's running: - EXO across multiple Mac Minis - Qwen 2.5 Coder 7B locally - Zed editor connected to the model - GPU usage visible in real time the math: - Mac Mini M4: $599 one-time - electricity: $3/month - API costs: $0 vs $459/month in subscriptions if you want Claude Code to run at this level ClaudeKit gives it the structure to do that ( full breakdown in the article below

THIS GUY STACKED MAC MINIS TO RUN AI LOCALLY AND IT CODES FOR FREE no API. no subscription. no cloud. what he's running: - EXO across multiple Mac Minis - Qwen 2.5 Coder 7B locally - Zed editor connected to the model - GPU usage visible in real time the math: - Mac Mini M4: $599 one-time - electricity: $3/month - API costs: $0 vs $459/month in subscriptions if you want Claude Code to run at this level ClaudeKit gives it the structure to do that ( full breakdown in the article below

Madni Aghadi

26,421 views • 12 days ago

A 23-year-old runs 3 businesses from a $550 Mac mini and his phone: no team, no office, no employees. He used to pay $15,000/month for a dev team of 5 people. What's running on that mac mini: > 38 specialized agents doing what 5 juniors used to do > 156 skills that fire automatically > One security scanner watching everything 24/7 > A system that learns his coding style over time Claude now writes in his voice automatically and formats outputs his way. The mac mini runs overnight, he checks the results on his phone in the morning. Three businesses, one machine, $20/month, not $15k.

A 23-year-old runs 3 businesses from a $550 Mac mini and his phone: no team, no office, no employees. He used to pay $15,000/month for a dev team of 5 people. What's running on that mac mini: > 38 specialized agents doing what 5 juniors used to do > 156 skills that fire automatically > One security scanner watching everything 24/7 > A system that learns his coding style over time Claude now writes in his voice automatically and formats outputs his way. The mac mini runs overnight, he checks the results on his phone in the morning. Three businesses, one machine, $20/month, not $15k.

Defileo🔮

86,981 views • 2 months ago

Chris Camillo says people used AI to replicate 18 years of his trading strategy in 48 hours on a $650 Mac Mini "At least 6 people in the last week have set up OpenClaw boxes and said I want you to replicate everything that Chris Camillo does, and it works 24 hours a day doing everything that I do" "It goes into TikTok, watches videos, scans comments, knows what words to look for like 'sold out' and 'obsessed' it understands my methodology and evaluates what's trending in real time" "They basically replicated 18 years of my work in 48 hours on a Mac Mini just voice prompting 'hey can you replicate what Chris Camillo does'"

Chris Camillo says people used AI to replicate 18 years of his trading strategy in 48 hours on a $650 Mac Mini "At least 6 people in the last week have set up OpenClaw boxes and said I want you to replicate everything that Chris Camillo does, and it works 24 hours a day doing everything that I do" "It goes into TikTok, watches videos, scans comments, knows what words to look for like 'sold out' and 'obsessed' it understands my methodology and evaluates what's trending in real time" "They basically replicated 18 years of my work in 48 hours on a Mac Mini just voice prompting 'hey can you replicate what Chris Camillo does'"

Yonan

248,008 views • 2 months ago