Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Cloud GPU training is a scam. A single M4 MacBook does 2.9 TFLOPS. Seven friends with MacBooks match an NVIDIA A100. Alexander Hayes just open-sourced a tool that makes this work over Wi-Fi. It's called AirTrain. Here's how it works: Traditional distributed training (DDP) syncs gradients after every single... step. For a 124M parameter model, that's ~500MB exchanged per step. You need 50 GB/s of sustained bandwidth. Impossible over Wi-Fi. AirTrain uses the DiLoCo algorithm. Each Mac trains independently for 500 steps, then syncs only the difference. One sync per 500 steps instead of one per step. 500x less network communication. Wi-Fi actually works. The entire sync takes ~2 seconds. Here's what makes it wild: → Zero-config discovery. Devices find each other automatically via mDNS/Bonjour. Same protocol as AirDrop. → Fault tolerant. Nodes can join and leave mid-training without killing the run. → Checkpoint relay. Train for a few hours, export a checkpoint, hand it off to someone else to continue. Like a relay race for ML training. → Built on Apple's MLX framework. Native to M1/M2/M3/M4/M5 unified memory. No host-to-device copy overhead. → Local dashboard. Real-time loss curves, peer monitoring, throughput metrics in your browser. Here's the wildest part: An M4 Max with 128GB unified memory can train a 70B parameter model without offloading. An NVIDIA RTX 4090 has 24GB VRAM. Apple Silicon gets ~245-460 GFLOPS per watt. Training on MacBooks costs almost nothing in electricity compared to cloud GPUs. And there are hundreds of millions of Apple Silicon Macs in the world. The math: Traditional DDP: 1 sync per step = 50 GB/s required AirTrain (DiLoCo): 1 sync per 500 steps = 0.1 GB/s required Wi-Fi handles 0.1 GB/s. That's it. That's the breakthrough. They even built a community platform at with live session browsing, checkpoint sharing, and a contributor leaderboard. Training a 124M parameter GPT-2? Instead of renting cloud GPUs at $3/hr, pool three MacBooks in a coffee shop and train for free. MIT licensed. Built in Python. 1 contributor. Early stage but the idea is insane. 100% Open Source. (Link in the comments)show more

Guri Singh

53,747 subscribers

160,201 Aufrufe • vor 3 Monaten •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

"Mac Minis for example are a very good fit" - Andrej Karpathy Andrej Karpathy shouted out my work on EXO Labs in his keynote at Y Combinator AI SUS! Here's the breakdown: Right now most AI workloads run in the cloud where requests from different users are continuously batched together. These workloads are FLOPS-bound and favors hardware with the best unit economics of $ per FLOP, i.e. enterprise GPUs. The personal computing revolution will shift these workloads to personal devices with lower batch sizes (mostly batch_size=1). batch_size=1 inference is memory-bound, because all of the model parameters need to be loaded into the GPU every time a token is generated. Apple Silicon with its Unified Memory architecture has a lot of memory and memory bandwidth per $ compared to other hardware: - M4 Pro Mac Mini, 24GB @ 273GB/s, $58.33/GB, $5.13/GB/s - H100, 80GB @ 3350GB/s, $625/GB, $14.93/GB/s The unit economics of Apple Silicon are becoming more compelling with every release of the Mac. The future of AI inference looks more like open weights models (OpenAI open weights model soonTM) run at low batch_size on personal devices.

"Mac Minis for example are a very good fit" - Andrej Karpathy Andrej Karpathy shouted out my work on EXO Labs in his keynote at Y Combinator AI SUS! Here's the breakdown: Right now most AI workloads run in the cloud where requests from different users are continuously batched together. These workloads are FLOPS-bound and favors hardware with the best unit economics of $ per FLOP, i.e. enterprise GPUs. The personal computing revolution will shift these workloads to personal devices with lower batch sizes (mostly batch_size=1). batch_size=1 inference is memory-bound, because all of the model parameters need to be loaded into the GPU every time a token is generated. Apple Silicon with its Unified Memory architecture has a lot of memory and memory bandwidth per $ compared to other hardware: - M4 Pro Mac Mini, 24GB @ 273GB/s, $58.33/GB, $5.13/GB/s - H100, 80GB @ 3350GB/s, $625/GB, $14.93/GB/s The unit economics of Apple Silicon are becoming more compelling with every release of the Mac. The future of AI inference looks more like open weights models (OpenAI open weights model soonTM) run at low batch_size on personal devices.

Alex Cheema

103,566 Aufrufe • vor 1 Jahr

no money for grok or midjourney? this tool is for you. there's a FREE tool created by an anon dev. open-source. runs locally. 117k stars on github. it generates: > images & video > 3d models > audio > 20+ models here's how to set it up in under 5 minutes: 1️⃣download ComfyUI Desktop go to and grab the desktop app for your system. windows 10+, mac (apple silicon), or linux. it installs like any normal app, it sets up python and every dependency for you in the background. no terminal, no config files. 2️⃣open it first launch, it spins up its own environment automatically. you just wait a few seconds and you're in. you'll land on a node canvas, that's the whole interface. 3️⃣load a starter workflow top menu → Workflow → Browse Templates → Image Generation. click it. this drops a ready-made setup onto your canvas so you don't build anything from scratch. 4️⃣grab a model comfyui ships empty on purpose, the model is the brain, and you pick it. in the template, the "Load Checkpoint" node has a Download button when no model is installed. click it. it pulls one in for you (a few GB, this is the only real wait). 5️⃣install ComfyUI Manager this is the one add-on you don't skip. it lets you install models, custom nodes, and updates with a click instead of the command line. grab it from github (link in comments). it's the difference between fighting comfyui and flying in it. one honest note: an NVIDIA gpu makes this fast, apple silicon works great too, and a weak machine still runs it just slower. that's the whole setup. you now own an image, video, and 3D studio that costs you nothing per month. save this. and the next time grok or midjourney asks for your card. you won't need it. disclaimer: comfyui itself is 100% free. so are the local models (sdxl, flux, wan 2.2, ltx-2). some premium models like seedance are pay-per-use api models, only if you want top-tier quality. the free local ones cover most of what you need. (github link in the comments) follow and turn on post notification for daily AI contents.

no money for grok or midjourney? this tool is for you. there's a FREE tool created by an anon dev. open-source. runs locally. 117k stars on github. it generates: > images & video > 3d models > audio > 20+ models here's how to set it up in under 5 minutes: 1️⃣download ComfyUI Desktop go to and grab the desktop app for your system. windows 10+, mac (apple silicon), or linux. it installs like any normal app, it sets up python and every dependency for you in the background. no terminal, no config files. 2️⃣open it first launch, it spins up its own environment automatically. you just wait a few seconds and you're in. you'll land on a node canvas, that's the whole interface. 3️⃣load a starter workflow top menu → Workflow → Browse Templates → Image Generation. click it. this drops a ready-made setup onto your canvas so you don't build anything from scratch. 4️⃣grab a model comfyui ships empty on purpose, the model is the brain, and you pick it. in the template, the "Load Checkpoint" node has a Download button when no model is installed. click it. it pulls one in for you (a few GB, this is the only real wait). 5️⃣install ComfyUI Manager this is the one add-on you don't skip. it lets you install models, custom nodes, and updates with a click instead of the command line. grab it from github (link in comments). it's the difference between fighting comfyui and flying in it. one honest note: an NVIDIA gpu makes this fast, apple silicon works great too, and a weak machine still runs it just slower. that's the whole setup. you now own an image, video, and 3D studio that costs you nothing per month. save this. and the next time grok or midjourney asks for your card. you won't need it. disclaimer: comfyui itself is 100% free. so are the local models (sdxl, flux, wan 2.2, ltx-2). some premium models like seedance are pay-per-use api models, only if you want top-tier quality. the free local ones cover most of what you need. (github link in the comments) follow and turn on post notification for daily AI contents.

m0h

14,542 Aufrufe • vor 1 Monat

100 MAC MINIS ON ONE METAL SHELF. IN THE CLOUD, EACH OF THOSE RENTS FOR $120 TO $500 A MONTH. that clip is a homemade rack: wire shelving packed with mac minis, each running macos, wired into one cluster. why macs and not a normal server? two reasons. you can only build, sign, and notarize ios and mac apps on apple hardware. cloud providers know it and charge for it. and apple silicon sips power and runs near-silent, so you can stack a lot of them on a shelf in a room, not a datacenter. the rent they're escaping: a single mac in the cloud runs ~$120 to $500 a month (macstadium up to aws) one engineer documented saving $4,000+ a month by self-hosting mac minis instead of cloud ci runners a used or new mini is a few hundred dollars once, then a few dollars of electricity what a shelf like this actually runs: ios and mac build farms, render jobs, app and device testing automation, and increasingly local ai inference. own the boxes, rent nothing. the uncomfortable part: the cloud was never the only option. it was the convenient one. the people who did the math bought the shelf. the honest caveat: it's real capex up front, real heat and power, and you become the sysadmin. no rented rack means no one to call at 3am. worth it at scale, overkill for one build a week. no rack rental, no per-hour metering, no fleet you don't own. save this before your cloud invoice renews again.

100 MAC MINIS ON ONE METAL SHELF. IN THE CLOUD, EACH OF THOSE RENTS FOR $120 TO $500 A MONTH. that clip is a homemade rack: wire shelving packed with mac minis, each running macos, wired into one cluster. why macs and not a normal server? two reasons. you can only build, sign, and notarize ios and mac apps on apple hardware. cloud providers know it and charge for it. and apple silicon sips power and runs near-silent, so you can stack a lot of them on a shelf in a room, not a datacenter. the rent they're escaping: a single mac in the cloud runs ~$120 to $500 a month (macstadium up to aws) one engineer documented saving $4,000+ a month by self-hosting mac minis instead of cloud ci runners a used or new mini is a few hundred dollars once, then a few dollars of electricity what a shelf like this actually runs: ios and mac build farms, render jobs, app and device testing automation, and increasingly local ai inference. own the boxes, rent nothing. the uncomfortable part: the cloud was never the only option. it was the convenient one. the people who did the math bought the shelf. the honest caveat: it's real capex up front, real heat and power, and you become the sysadmin. no rented rack means no one to call at 3am. worth it at scale, overkill for one build a week. no rack rental, no per-hour metering, no fleet you don't own. save this before your cloud invoice renews again.

RetroChainer

21,055 Aufrufe • vor 14 Tagen

Everyone wrote Apple off as the AI loser, but one hardware spec might flip that story upside down (Save this). @jason called Apple a screaming buy on the back of a single chip detail. The rumored M7 Ultra, expected around 2028, is designed to support up to 1.5TB of unified memory, enough to run frontier class trillion parameter AI models locally, with no cloud required. The Street's bear case on Apple is straightforward. Apple has no frontier model of its own, Siri has stumbled for years and the company effectively rents OpenAI's models for its hardest queries. That narrative treats Apple as the one Magnificent Seven name that missed the AI wave entirely but the bull case flips that framing on its head. If frontier AI models keep shrinking and getting cheaper to run, Apple doesn't need the smartest model in the world, it just needs to own the device that model runs on. And unified memory is the mechanism that makes this possible. Unlike traditional systems where the CPU and GPU each need separate memory, Apple's architecture lets the CPU, GPU and Neural Engine draw from one shared pool. A fully specced M7 Ultra could theoretically run something on the scale of a 1.2 trillion parameter model locally and that capability plugs directly into the one advantage Apple has spent over a decade building: privacy. Apple has already shipped Private Cloud Compute, a system designed so even Apple can't access user data processed off device. Apple doubled down on this at WWDC 2026, framing on device privacy as non-negotiable while rivals default to the cloud. If the best AI models get small enough to run on Apple silicon, the moat stops being the model and becomes the hardware it has to sit on. Milk Road Pro remains bullish on Apple and it remains as one of our core positions, if you want the full thesis + our full AI trades, come join us using the link below for just a $1.

Everyone wrote Apple off as the AI loser, but one hardware spec might flip that story upside down (Save this). @jason called Apple a screaming buy on the back of a single chip detail. The rumored M7 Ultra, expected around 2028, is designed to support up to 1.5TB of unified memory, enough to run frontier class trillion parameter AI models locally, with no cloud required. The Street's bear case on Apple is straightforward. Apple has no frontier model of its own, Siri has stumbled for years and the company effectively rents OpenAI's models for its hardest queries. That narrative treats Apple as the one Magnificent Seven name that missed the AI wave entirely but the bull case flips that framing on its head. If frontier AI models keep shrinking and getting cheaper to run, Apple doesn't need the smartest model in the world, it just needs to own the device that model runs on. And unified memory is the mechanism that makes this possible. Unlike traditional systems where the CPU and GPU each need separate memory, Apple's architecture lets the CPU, GPU and Neural Engine draw from one shared pool. A fully specced M7 Ultra could theoretically run something on the scale of a 1.2 trillion parameter model locally and that capability plugs directly into the one advantage Apple has spent over a decade building: privacy. Apple has already shipped Private Cloud Compute, a system designed so even Apple can't access user data processed off device. Apple doubled down on this at WWDC 2026, framing on device privacy as non-negotiable while rivals default to the cloud. If the best AI models get small enough to run on Apple silicon, the moat stops being the model and becomes the hardware it has to sit on. Milk Road Pro remains bullish on Apple and it remains as one of our core positions, if you want the full thesis + our full AI trades, come join us using the link below for just a $1.

Milk Road AI

37,330 Aufrufe • vor 12 Tagen

a team of researchers just proved you don't need a bigger model, you need a smarter plan researchers from Tsinghua and South China University of Technology built a framework called Atomic Task Graph. it turned 7B-8B open-source models into GPT-4 competitors on complex agent benchmarks, beating it on two out of three. no fine-tuning. no extra training. zero parameter updates. current AI agents plan in a straight line. step 1, step 2, step 3. when step 4 fails, the whole chain breaks. and the longer the chain gets, the more the model hallucinates because it's reasoning over a ballooning text history. here's how it works. 1. instead of a linear chain, ATG breaks any complex task into a directed graph where subtask inputs and outputs are explicitly mapped 2. it recursively decomposes each subtask until every node is one atomic tool call 3. independent branches run in parallel instead of waiting in line 4. before anything executes, a lightweight "thought experiment" simulates the plan internally to catch bad dependencies and missing steps early 5. when something breaks at runtime, ATG traces the failure to the exact subgraph that caused it and repairs only that piece. validated work stays frozen. the old way meant a failure at step 5 forced a full replan from scratch. hallucinated actions piled up the longer the task ran. ReAct hit a 43% hallucination rate on household tasks. ATG on an 8B Llama model scored 63.65 on ALFWorld. GPT-4 with ReAct scored 41.24 on the same benchmark. hallucinated actions dropped to 12%. those numbers happened because someone stopped throwing compute at the problem and started thinking about how work gets organized. that's the part that gets me. the industry is spending billions on scale. this team spent time on architecture. and the architecture won.

a team of researchers just proved you don't need a bigger model, you need a smarter plan researchers from Tsinghua and South China University of Technology built a framework called Atomic Task Graph. it turned 7B-8B open-source models into GPT-4 competitors on complex agent benchmarks, beating it on two out of three. no fine-tuning. no extra training. zero parameter updates. current AI agents plan in a straight line. step 1, step 2, step 3. when step 4 fails, the whole chain breaks. and the longer the chain gets, the more the model hallucinates because it's reasoning over a ballooning text history. here's how it works. 1. instead of a linear chain, ATG breaks any complex task into a directed graph where subtask inputs and outputs are explicitly mapped 2. it recursively decomposes each subtask until every node is one atomic tool call 3. independent branches run in parallel instead of waiting in line 4. before anything executes, a lightweight "thought experiment" simulates the plan internally to catch bad dependencies and missing steps early 5. when something breaks at runtime, ATG traces the failure to the exact subgraph that caused it and repairs only that piece. validated work stays frozen. the old way meant a failure at step 5 forced a full replan from scratch. hallucinated actions piled up the longer the task ran. ReAct hit a 43% hallucination rate on household tasks. ATG on an 8B Llama model scored 63.65 on ALFWorld. GPT-4 with ReAct scored 41.24 on the same benchmark. hallucinated actions dropped to 12%. those numbers happened because someone stopped throwing compute at the problem and started thinking about how work gets organized. that's the part that gets me. the industry is spending billions on scale. this team spent time on architecture. and the architecture won.

Alex Veremeyenko

173,206 Aufrufe • vor 20 Tagen

Alibaba just released a coding model that hits 82 percent on SWE-Bench Verified. That is the highest score ever published for an open-source model. The weights are free. The license is Apache 2.0. You can run it today. The model is Qwen 4 Coder 32B. Here is what 82 percent on SWE-Bench Verified actually means. SWE-Bench Verified tests whether an AI can autonomously resolve real bugs pulled from real production GitHub repositories. Not synthetic exercises. Real open-source projects that real teams depend on. A model gets a bug report, reads the code, writes a fix, and either passes the test suite or it does not. At 82 percent, Qwen 4 Coder 32B resolves 82 out of every 100 real production bugs it is given. Without a human guiding it. On code it has never seen before. For comparison: Qwen 4 Coder 32B: 82 percent SWE-Bench Verified. Open source. Apache 2.0. Claude Fable 5: 80.3 percent SWE-Bench Pro. $10 input / $50 output per million tokens. Currently suspended. GPT-5.6 Sol: Competitive on Terminal-Bench. $5 input / $30 output per million tokens. An open-weight model that you can download and run for free just beat both of them on the benchmark designed to measure real software engineering capability. Here is the architecture. Qwen 4 Coder 32B is a 32 billion parameter dense model. Not a Mixture-of-Experts. Every parameter is active on every request. This matters for inference: a dense 32B model runs on 22 gigabytes of VRAM, which fits on a single high-end consumer GPU or a MacBook Pro with 64GB of unified memory. The smaller variant, Qwen 4 Coder 4B, runs at approximately 135 tokens per second on an M5 Max and fits inside 8 gigabytes of RAM. For a model with usable coding capability, that is a new bar for what fits in a single laptop. The training methodology continued Alibaba's approach of reinforcement learning on verifiable coding tasks. The model gets rewarded when its code passes tests. It gets penalized when it fails. Over millions of training steps, the model learns to write code that actually runs rather than code that looks plausible. License: Apache 2.0. Full commercial use. No attribution requirement. No revenue threshold. No monthly active user ceiling. Weights: Hugging Face, available today. Runs on: vLLM, Ollama, SGLang, and any standard GGUF-compatible inference engine. Qwen 4 32B also runs at approximately 135 tokens per second on an M5 Max chip, setting a new bar for what a sub-8GB model can do on Apple Silicon. The open-source coding model just beat the best closed-source model in the world on the benchmark designed to test whether AI can actually do software engineering. The weights are free. The subscription is optional. Source: Autom8Labs AI Insight July 2026, State of Open Source LLMs June 2026, Kunal Ganglani blog June 2026.

Alibaba just released a coding model that hits 82 percent on SWE-Bench Verified. That is the highest score ever published for an open-source model. The weights are free. The license is Apache 2.0. You can run it today. The model is Qwen 4 Coder 32B. Here is what 82 percent on SWE-Bench Verified actually means. SWE-Bench Verified tests whether an AI can autonomously resolve real bugs pulled from real production GitHub repositories. Not synthetic exercises. Real open-source projects that real teams depend on. A model gets a bug report, reads the code, writes a fix, and either passes the test suite or it does not. At 82 percent, Qwen 4 Coder 32B resolves 82 out of every 100 real production bugs it is given. Without a human guiding it. On code it has never seen before. For comparison: Qwen 4 Coder 32B: 82 percent SWE-Bench Verified. Open source. Apache 2.0. Claude Fable 5: 80.3 percent SWE-Bench Pro. $10 input / $50 output per million tokens. Currently suspended. GPT-5.6 Sol: Competitive on Terminal-Bench. $5 input / $30 output per million tokens. An open-weight model that you can download and run for free just beat both of them on the benchmark designed to measure real software engineering capability. Here is the architecture. Qwen 4 Coder 32B is a 32 billion parameter dense model. Not a Mixture-of-Experts. Every parameter is active on every request. This matters for inference: a dense 32B model runs on 22 gigabytes of VRAM, which fits on a single high-end consumer GPU or a MacBook Pro with 64GB of unified memory. The smaller variant, Qwen 4 Coder 4B, runs at approximately 135 tokens per second on an M5 Max and fits inside 8 gigabytes of RAM. For a model with usable coding capability, that is a new bar for what fits in a single laptop. The training methodology continued Alibaba's approach of reinforcement learning on verifiable coding tasks. The model gets rewarded when its code passes tests. It gets penalized when it fails. Over millions of training steps, the model learns to write code that actually runs rather than code that looks plausible. License: Apache 2.0. Full commercial use. No attribution requirement. No revenue threshold. No monthly active user ceiling. Weights: Hugging Face, available today. Runs on: vLLM, Ollama, SGLang, and any standard GGUF-compatible inference engine. Qwen 4 32B also runs at approximately 135 tokens per second on an M5 Max chip, setting a new bar for what a sub-8GB model can do on Apple Silicon. The open-source coding model just beat the best closed-source model in the world on the benchmark designed to test whether AI can actually do software engineering. The weights are free. The subscription is optional. Source: Autom8Labs AI Insight July 2026, State of Open Source LLMs June 2026, Kunal Ganglani blog June 2026.

Harman

41,179 Aufrufe • vor 23 Tagen

The bottleneck in AI has quietly shifted. - It's not the models. They are capable. - It's not the frameworks. They are mature. - It's not even the data, in many cases. When you want to train a model today, the first question isn't "what architecture should I use?" Instead, it's: "Where am I going to get infrastructure that actually works?" Not just GPUs but the entire stack: compute, deployment, scaling, storage. The traditional path is major cloud providers or specialized GPU clouds. Both have the same problem: they're built for enterprises with committed workloads, minimum spend requirements, contract negotiations, and involve quota approvals that take days. Even the "on-demand" options require you to piece together training, deployment, and scaling across different services. By the time you're actually training, hours, if not days, have passed. And there's a subtler cost: part of your brain is always managing infrastructure instead of thinking about the actual problem. I've been using Runpod for a while now, and it's the closest I've found to infrastructure that just disappears. I pay for the serverless solution by the second, and stop when I'm done. This sounds like it should be the default across all providers, but it isn't. For instance, when I'm prototyping, I don't need an H100. Instead, I need the flexibility to use cheaper GPUs that are actually available, where I can iterate fast and not worry about cost. An A40 at a few cents per hour is perfect for this. Then, when the approach is validated, I scale up. This matches how good engineering actually works. Running distributed training across multiple nodes for multi-GPU training usually requires significant infra work. RunPod abstracts most of this away. A lot of the advantage in AI comes from iteration speed. Infra that adds days of latency to that loop is a real cost, even if it's hard to measure. But good infra gets out of your way. It's available when you need it, invisible when you don't. In the video below, I have shown a simple model training workflow trained using PyTorch in Jupyter Lab. It runs in a dedicated PyTorch Pod hosted on Runpod, and I worked with the team to put this together for you. Find a link to start using Runpod in the replies!

The bottleneck in AI has quietly shifted. - It's not the models. They are capable. - It's not the frameworks. They are mature. - It's not even the data, in many cases. When you want to train a model today, the first question isn't "what architecture should I use?" Instead, it's: "Where am I going to get infrastructure that actually works?" Not just GPUs but the entire stack: compute, deployment, scaling, storage. The traditional path is major cloud providers or specialized GPU clouds. Both have the same problem: they're built for enterprises with committed workloads, minimum spend requirements, contract negotiations, and involve quota approvals that take days. Even the "on-demand" options require you to piece together training, deployment, and scaling across different services. By the time you're actually training, hours, if not days, have passed. And there's a subtler cost: part of your brain is always managing infrastructure instead of thinking about the actual problem. I've been using Runpod for a while now, and it's the closest I've found to infrastructure that just disappears. I pay for the serverless solution by the second, and stop when I'm done. This sounds like it should be the default across all providers, but it isn't. For instance, when I'm prototyping, I don't need an H100. Instead, I need the flexibility to use cheaper GPUs that are actually available, where I can iterate fast and not worry about cost. An A40 at a few cents per hour is perfect for this. Then, when the approach is validated, I scale up. This matches how good engineering actually works. Running distributed training across multiple nodes for multi-GPU training usually requires significant infra work. RunPod abstracts most of this away. A lot of the advantage in AI comes from iteration speed. Infra that adds days of latency to that loop is a real cost, even if it's hard to measure. But good infra gets out of your way. It's available when you need it, invisible when you don't. In the video below, I have shown a simple model training workflow trained using PyTorch in Jupyter Lab. It runs in a dedicated PyTorch Pod hosted on Runpod, and I worked with the team to put this together for you. Find a link to start using Runpod in the replies!

Avi Chawla

13,696 Aufrufe • vor 6 Monaten

NVIDIA JUST SHOWED HOW A $4,699 BOX CAN KILL A $200 MONTHLY AI BILL AND RUN 200B MODELS FROM YOUR DESK. 00:02 NVIDIA delivers the DGX Spark to developers, researchers, and AI teams, turning what used to require rented datacenter GPUs into a box small enough to hold in one hand. inside is 128GB of unified memory, 4TB of storage, and up to 1 PFLOP of AI compute, enough to run 70B models comfortably and push into the 200B class locally. that means no per-token API bill, no files leaving your machine, and no waiting for someone else’s server every time an agent reads documents, writes code, or runs an automation. two DGX Sparks can even be linked together for models around 400B parameters, while a single unit costs roughly $4,000 to $4,700 instead of renting multi-GPU cloud infrastructure every month. the full local AI ladder starts at $0 with the computer you already own, but NVIDIA just showed what the final rung looks like: a personal AI supercomputer sitting directly on your desk.

NVIDIA JUST SHOWED HOW A $4,699 BOX CAN KILL A $200 MONTHLY AI BILL AND RUN 200B MODELS FROM YOUR DESK. 00:02 NVIDIA delivers the DGX Spark to developers, researchers, and AI teams, turning what used to require rented datacenter GPUs into a box small enough to hold in one hand. inside is 128GB of unified memory, 4TB of storage, and up to 1 PFLOP of AI compute, enough to run 70B models comfortably and push into the 200B class locally. that means no per-token API bill, no files leaving your machine, and no waiting for someone else’s server every time an agent reads documents, writes code, or runs an automation. two DGX Sparks can even be linked together for models around 400B parameters, while a single unit costs roughly $4,000 to $4,700 instead of renting multi-GPU cloud infrastructure every month. the full local AI ladder starts at $0 with the computer you already own, but NVIDIA just showed what the final rung looks like: a personal AI supercomputer sitting directly on your desk.

Gipp 🦅

11,334 Aufrufe • vor 12 Tagen

I'm running Llama 4 Maverick at 620 t/s! I'm living in the future! Honestly, a large language model running this fast is something straight out of a sci-fi movie. Speeds like this will enable a whole new world of applications that aren't possible today. For reference, GPT-4o, which is probably the most popular OpenAI model, runs between 60 and 110 t/s. The secret here: I'm not running AI at Meta's Llama 4 Maverick on a GPU. I'm using the SambaNova Cloud (my sponsor) and their custom SN40L chips. They are optimized from the ground up for running AI workflows. Right now, SambaNova Cloud runs DeepSeek, Qwen, Whisper, and the entire family of Llama models on these chips. You can check the speed of each of these models using SambaNova Cloud's Playground (see the attached video). It's completely free, and that's how I'm measuring their speeds. For example, I also tried DeepSeek R1 (the latest version from May) and, oh boy! DeepSeek R1 is a huge 671B parameter model. It's probably the best open reasoning model in the world, and it runs at 140 tokens per second! !!! Inference time on an SN40L is night and day from what you'll get from a GPU. Here is why this is big: If you are running an agentic workflow that uses multiple models simultaneously on a GPU, it will need to swap models in and out of memory (because not every model fits). A single SNL40 chip can simultaneously hold over 100 models (trillions of parameters) in memory. If you are using open models, try the SambaCloud API to see what lightning speed looks like. Here is how: 1. Create a free account at: 2. Check the QuickStart guide: If you try the playground, check the speed you're getting with Llama 4 and DeepSeek, and post the results below. I've seen much higher numbers than I posted here, so I'm curious to see whether geography affects the speed.

I'm running Llama 4 Maverick at 620 t/s! I'm living in the future! Honestly, a large language model running this fast is something straight out of a sci-fi movie. Speeds like this will enable a whole new world of applications that aren't possible today. For reference, GPT-4o, which is probably the most popular OpenAI model, runs between 60 and 110 t/s. The secret here: I'm not running AI at Meta's Llama 4 Maverick on a GPU. I'm using the SambaNova Cloud (my sponsor) and their custom SN40L chips. They are optimized from the ground up for running AI workflows. Right now, SambaNova Cloud runs DeepSeek, Qwen, Whisper, and the entire family of Llama models on these chips. You can check the speed of each of these models using SambaNova Cloud's Playground (see the attached video). It's completely free, and that's how I'm measuring their speeds. For example, I also tried DeepSeek R1 (the latest version from May) and, oh boy! DeepSeek R1 is a huge 671B parameter model. It's probably the best open reasoning model in the world, and it runs at 140 tokens per second! !!! Inference time on an SN40L is night and day from what you'll get from a GPU. Here is why this is big: If you are running an agentic workflow that uses multiple models simultaneously on a GPU, it will need to swap models in and out of memory (because not every model fits). A single SNL40 chip can simultaneously hold over 100 models (trillions of parameters) in memory. If you are using open models, try the SambaCloud API to see what lightning speed looks like. Here is how: 1. Create a free account at: 2. Check the QuickStart guide: If you try the playground, check the speed you're getting with Llama 4 and DeepSeek, and post the results below. I've seen much higher numbers than I posted here, so I'm curious to see whether geography affects the speed.

Santiago

34,148 Aufrufe • vor 1 Jahr

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

Rohan Paul

13,148 Aufrufe • vor 2 Monaten

This Chinese developer runs 9 agents on Claude Code under a GPT-5.5 orchestrator and they close 500 client tasks a month without a single assistant. His client work is closed without him, on a single laptop and only three subscriptions. The entire system lives on one MacBook Pro M4 with 128 GB of memory and subscriptions to Claude Code and GPT-5.5 cost him approximately $300 a month. There is no CRM, no team, no office only a terminal window with 9 parallel streams. The orchestrator works with a simple system prompt: «You are the orchestrator of a client inbox. Classify every incoming email into 4 categories: code, content, analysis, communication. Delegate to the corresponding worker agent. When the result is ready, check it for completeness, send it to the client on my behalf, and mark the task as closed. Do not ask clarifying questions.» And the orchestrator checks the inbox every 30 seconds, classifies fresh emails, and distributes them to 9 worker agents on Claude Code, each of whom is responsible for their own class of tasks. Here is an example of how one of them closes a request to refactor a client's auth module: Task: refactor user-auth module Broke the monolith into 3 files by responsibilities Added unit tests, coverage increased to 87% Renamed 4 functions to camelCase according to the style guide PR is ready for review, link below» And so about 50 cycles a day. By noon 25 tasks are closed, by dinner 50, and by the end of the month 500. On average, it takes about 7 minutes from the appearance of an email in the inbox to sending the result to the client. This is more than what a live team of 6 developers, copywriters and analysts working 8 hours a day closes. This is no longer an agency. This is a workstation where an orchestrator replaces a manager, and 9 worker agents replace the staff. The pipeline goes from inbox to closing 500 times a month without human participation at any step.

This Chinese developer runs 9 agents on Claude Code under a GPT-5.5 orchestrator and they close 500 client tasks a month without a single assistant. His client work is closed without him, on a single laptop and only three subscriptions. The entire system lives on one MacBook Pro M4 with 128 GB of memory and subscriptions to Claude Code and GPT-5.5 cost him approximately $300 a month. There is no CRM, no team, no office only a terminal window with 9 parallel streams. The orchestrator works with a simple system prompt: «You are the orchestrator of a client inbox. Classify every incoming email into 4 categories: code, content, analysis, communication. Delegate to the corresponding worker agent. When the result is ready, check it for completeness, send it to the client on my behalf, and mark the task as closed. Do not ask clarifying questions.» And the orchestrator checks the inbox every 30 seconds, classifies fresh emails, and distributes them to 9 worker agents on Claude Code, each of whom is responsible for their own class of tasks. Here is an example of how one of them closes a request to refactor a client's auth module: Task: refactor user-auth module Broke the monolith into 3 files by responsibilities Added unit tests, coverage increased to 87% Renamed 4 functions to camelCase according to the style guide PR is ready for review, link below» And so about 50 cycles a day. By noon 25 tasks are closed, by dinner 50, and by the end of the month 500. On average, it takes about 7 minutes from the appearance of an email in the inbox to sending the result to the client. This is more than what a live team of 6 developers, copywriters and analysts working 8 hours a day closes. This is no longer an agency. This is a workstation where an orchestrator replaces a manager, and 9 worker agents replace the staff. The pipeline goes from inbox to closing 500 times a month without human participation at any step.

Blaze

29,917 Aufrufe • vor 2 Monaten

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

This Chinese developer linked two $2,999 NVIDIA DGX Sparks into one box and runs the full Qwen3-235B at home, after dropping his $1,999-a-month cloud bill to zero. He wired 2 small boxes into a single computer, split a giant 235-billion-parameter model in half between them, and serves it across his own network at about 10 tokens a second, with no internet, no cloud, right there on the desk. No data center, no thousand-dollar graphics cards, no monthly cloud bill. Just him, 2 gold boxes the size of a sandwich, one cable between them, and 1 power strip. And here is the whole payoff. He used to pay the cloud $1,999 a month for the same model, and the meter ticked on every request. Now he paid $5,998 once for 2 boxes, they covered their cost in 3 months, and after that he sends as many requests as he wants for free, only electricity. The two Sparks talk over one fast cable, each holds 128GB of memory, and together they carry the whole model, about 73GB loaded per box, with the chip inside pinned near the limit at 96%. Both boxes work as one and keep trading data over the cable, with no cloud in the loop and no single word leaking out. The ready model sits on one local address, and any app on his network calls it as easily as ChatGPT. And here is how he described, in plain words, what this pair of boxes does: "this is a pair of boxes that holds the huge Qwen3-235B model and serves it to one network. the model is split in half, and each box owns its half. parts: // Box 1 (holds the first half of the model and starts the answer fast, the first word appears in under a second) // Box 2 (holds the second half and writes out the rest, about 10 tokens a second) // Cable (connects the 2 boxes and moves data between them on every step, with no lag) // Address (one local address where any app sends its request, like to a cloud model) // Test (a script that runs big prompts through and measures speed and delays) // Monitor (checks temperature, power draw, and load on both boxes every 2 seconds). the model never goes to the cloud. he only steps in when a box runs hotter than 80 degrees or the cable between them starts dropping data." So the system knows exactly what it is, what it is for, and where its limits are. It knows it has to hold the whole huge model across 2 boxes on its own. It knows it has to answer every request locally, with no meter, no limits, and no internet. It knows the human is only needed when a box overheats or the link between them stalls. → The setup runs around the clock on 2 boxes, each pulling under 60 watts → However many requests he sends, the monthly bill is $0, only electricity → The first box starts the answer in under a second → The second writes text at about 10 tokens a second → One request at a time: 838 tokens in 85 seconds, first word in 0.8s → Two requests at once: 697 tokens in 108 seconds, first word in 0.7s → Both boxes sit at 96% load and warm up to 76-78 degrees And only when a chip in a box runs hotter than 80 degrees or the cable between the 2 Sparks drops data does the system call the owner. And when he himself is out on a run or in a coffee shop, he still reaches his own model at home from his phone: sends a big prompt to the local Qwen3-235B, gets the full answer back in under a minute and a half, with no token meter ticking and no limit to hit. Here is what the test shows on his screen during one of the night runs: "one request at a time: 838 tokens in 84.9 seconds, first word in 0.8s, then 0.1s per token." "two requests at once: 697 tokens in 107.6 seconds, first word in 0.7s, then 0.15s per token." "Box 1: chip at 96% load, 76 degrees, 56 watts, 73GB used in memory." "Box 2: chip at 96% load, 78 degrees, 56 watts, the Qwen3-235B model fully loaded." And while everyone around is paying for AI by the month and bumping into limits, his top-tier model just sits on the desk and works as much as he wants: his own little power plant instead of a forever meter. He has no server rack of his own and no cloud account behind it. Just 2 DGX Spark boxes on a desk, one model split in half between them, one local address, and a folder of prompts next to it. Out of everything I have seen this year, this is the cleanest way to stop paying for AI: $5,998 of hardware on the desk once, $0 a month to the cloud, unlimited forever, and between them 2 gold boxes, 1 cable, and the full Qwen3-235B answering at home with no internet.

Blaze

93,871 Aufrufe • vor 2 Monaten

We made a thing! Very happy to announce sqlcoder-pro and the Defog Alignment Platform. Available to use immediately without a wait-list, weights will be open-sourced very soon. The video does a quick show and tell comparison against ChatGPT (with gpt-4o). Read on for more details! TLDR 💪 equal (or better) performance on text-to-SQL as the most capable Claude-3.5 or GPT-4 models 🤝 You can use it today on a free plan/free trial, without a waitlist 🪽 self-hostable on a single RTX4090, with 2 second median generation times for SQL queries 🔁 exactly the same output every time, give the same prompt 👨🏻‍🏫 teachable and steerable: show the model what you want it to do 🛞 debuggable – you can understand WTF is going on inside the model, instead of treating it like a black box Let's dig into each of these one-by-one! Performance SQLCoder-8b-pro significantly exceeds the performance of our previous sqlcoder-8b model on Postgres text-to-SQL (from 88.2% to 90.2% accuracy - gpt-4o is at 87.6%, for reference). It is also better at following instructions. This was done via self-merges, hand crafted fine-tuning data, and adapting the training data to fit our tokenizer. Cost You can host this on the model on a single $3,500 RTX4090, and support ~5 requests/second via VLLM. If you're looking to host on the cloud instead, you can run it on a single L4 GPU that costs $300/mo on GCP Repeatability We have a dense 8b model with no MoE shenanigans. For the same prompt with temperature=0, you'll always get the same answer – which is critical in BI. Teachable In our alignment and feedback modes, you can give the model feedback on how it answered certain questions, and it will automatically adapt to the feedback. Debuggable You can use logprobs and attention scores to determine where, exactly is the model paying attention to inside a prompt + what it's getting confused by when generating outputs. Available today You can use Defog on the cloud today by going to docs[dot]defog[dot]ai, and getting an API key. Excited to hear what you think!

We made a thing! Very happy to announce sqlcoder-pro and the Defog Alignment Platform. Available to use immediately without a wait-list, weights will be open-sourced very soon. The video does a quick show and tell comparison against ChatGPT (with gpt-4o). Read on for more details! TLDR 💪 equal (or better) performance on text-to-SQL as the most capable Claude-3.5 or GPT-4 models 🤝 You can use it today on a free plan/free trial, without a waitlist 🪽 self-hostable on a single RTX4090, with 2 second median generation times for SQL queries 🔁 exactly the same output every time, give the same prompt 👨🏻‍🏫 teachable and steerable: show the model what you want it to do 🛞 debuggable – you can understand WTF is going on inside the model, instead of treating it like a black box Let's dig into each of these one-by-one! Performance SQLCoder-8b-pro significantly exceeds the performance of our previous sqlcoder-8b model on Postgres text-to-SQL (from 88.2% to 90.2% accuracy - gpt-4o is at 87.6%, for reference). It is also better at following instructions. This was done via self-merges, hand crafted fine-tuning data, and adapting the training data to fit our tokenizer. Cost You can host this on the model on a single $3,500 RTX4090, and support ~5 requests/second via VLLM. If you're looking to host on the cloud instead, you can run it on a single L4 GPU that costs $300/mo on GCP Repeatability We have a dense 8b model with no MoE shenanigans. For the same prompt with temperature=0, you'll always get the same answer – which is critical in BI. Teachable In our alignment and feedback modes, you can give the model feedback on how it answered certain questions, and it will automatically adapt to the feedback. Debuggable You can use logprobs and attention scores to determine where, exactly is the model paying attention to inside a prompt + what it's getting confused by when generating outputs. Available today You can use Defog on the cloud today by going to docs[dot]defog[dot]ai, and getting an API key. Excited to hear what you think!

Rishabh Srivastava

13,460 Aufrufe • vor 1 Jahr

September 2009. Jensen Huang walks onto a small stage at the Fairmont hotel in San Jose. About 1,500 people are in the room. He runs a company that makes chips for video games. He spends the next 8 minutes doing math on a whiteboard, explaining why the future of computing won't come from making CPUs faster. He calls it "CEO math" and apologizes in advance to every computer science professor in the audience. Then he lays out an argument that almost nobody took seriously at the time: the way to make computers dramatically faster is to pair a regular CPU with hundreds of tiny parallel processors, the kind that already exist inside graphics cards. One CPU for the sequential stuff. Hundreds of GPU cores for everything else. He calls it "heterogeneous computing." He shows the math. A workload that can be split into many pieces at once gets up to 200x faster on this combined system. A workload that has to run one step at a time loses nothing. "The most important thing in creating a new architecture," he says, "is to make sure it does no harm." This was the first GPU Technology Conference. NVIDIA had launched a software platform called CUDA three years earlier, in 2006, to let developers write programs that run on graphics cards instead of just regular processors. Almost nobody cared. GPUs were for rendering Call of Duty, not for scientific computing. The academic world was polite but skeptical. The enterprise world ignored it entirely. By this point, Huang had been making this argument for years. NVIDIA was a $7 billion company. It competed with AMD and Intel for market share in the graphics market. That was the whole business. Jensen kept saying the GPU wasn't just a gaming chip; it was a computing platform. He kept saying parallel processing would reshape every industry from medicine to finance to physics simulations. People kept nodding, then doing nothing. Then deep learning happened. Around 2012, AI researchers discovered that training a neural network, which means teaching a computer to recognize patterns by running the same calculation millions of times across huge datasets, was exactly the kind of workload Jensen had been describing. GPUs can train AI models 10 to 50 times faster than CPUs. The architecture he outlined in this 2009 talk, with one CPU handling step-by-step tasks while hundreds of GPU cores crunch through massive amounts of parallel data, is now the literal blueprint for every AI data center on earth. ChatGPT runs on NVIDIA GPUs. Claude runs on NVIDIA GPUs. Gemini, Llama, Midjourney, nearly every major AI model you've heard of was trained on NVIDIA hardware using CUDA, the software platform Jensen built for a market that didn't exist yet. NVIDIA was worth about $7 billion when Jensen gave this talk. It is worth over $4.4 trillion today. That's a 600x increase. Jensen Huang, who founded the company at a Denny's in 1993 with two friends, now has a net worth of over $160 billion. He made Forbes' list of the 10 richest people for the first time this year. GTC 2026 is currently ongoing. 17,000 people are packing a hockey arena to watch the same guy explain what comes next. In 2009, 1,500 people showed up at a hotel ballroom, most of them for gaming graphics.

September 2009. Jensen Huang walks onto a small stage at the Fairmont hotel in San Jose. About 1,500 people are in the room. He runs a company that makes chips for video games. He spends the next 8 minutes doing math on a whiteboard, explaining why the future of computing won't come from making CPUs faster. He calls it "CEO math" and apologizes in advance to every computer science professor in the audience. Then he lays out an argument that almost nobody took seriously at the time: the way to make computers dramatically faster is to pair a regular CPU with hundreds of tiny parallel processors, the kind that already exist inside graphics cards. One CPU for the sequential stuff. Hundreds of GPU cores for everything else. He calls it "heterogeneous computing." He shows the math. A workload that can be split into many pieces at once gets up to 200x faster on this combined system. A workload that has to run one step at a time loses nothing. "The most important thing in creating a new architecture," he says, "is to make sure it does no harm." This was the first GPU Technology Conference. NVIDIA had launched a software platform called CUDA three years earlier, in 2006, to let developers write programs that run on graphics cards instead of just regular processors. Almost nobody cared. GPUs were for rendering Call of Duty, not for scientific computing. The academic world was polite but skeptical. The enterprise world ignored it entirely. By this point, Huang had been making this argument for years. NVIDIA was a $7 billion company. It competed with AMD and Intel for market share in the graphics market. That was the whole business. Jensen kept saying the GPU wasn't just a gaming chip; it was a computing platform. He kept saying parallel processing would reshape every industry from medicine to finance to physics simulations. People kept nodding, then doing nothing. Then deep learning happened. Around 2012, AI researchers discovered that training a neural network, which means teaching a computer to recognize patterns by running the same calculation millions of times across huge datasets, was exactly the kind of workload Jensen had been describing. GPUs can train AI models 10 to 50 times faster than CPUs. The architecture he outlined in this 2009 talk, with one CPU handling step-by-step tasks while hundreds of GPU cores crunch through massive amounts of parallel data, is now the literal blueprint for every AI data center on earth. ChatGPT runs on NVIDIA GPUs. Claude runs on NVIDIA GPUs. Gemini, Llama, Midjourney, nearly every major AI model you've heard of was trained on NVIDIA hardware using CUDA, the software platform Jensen built for a market that didn't exist yet. NVIDIA was worth about $7 billion when Jensen gave this talk. It is worth over $4.4 trillion today. That's a 600x increase. Jensen Huang, who founded the company at a Denny's in 1993 with two friends, now has a net worth of over $160 billion. He made Forbes' list of the 10 richest people for the first time this year. GTC 2026 is currently ongoing. 17,000 people are packing a hockey arena to watch the same guy explain what comes next. In 2009, 1,500 people showed up at a hotel ballroom, most of them for gaming graphics.

Anish Moonka

412,600 Aufrufe • vor 4 Monaten

Every AI agent you've tried has amnesia. It does one task, forgets everything, and tomorrow you start from zero. That's not an employee. That's a temp you have to retrain every single morning. Hyperagent by Airtable is the first platform I've used that actually fixes this. Here's what got me: 1. Agents that compound. Each agent has memory. The one running today is smarter than the one you shipped three weeks ago. Same prompt, same integrations, but weeks of your judgment baked in. 2. Real deliverables, real receipts. You don't get a chat transcript. You get finished work with the cost and runtime printed right on it. A full research report for under ten bucks. Try getting that invoice from an agency. 3. A fleet, not a chatbot. Build a specialist for outreach, another for research, another for reporting. Give each one its own tools, its own memory, and its own budget cap so nothing runs away with your credits. 4. Deploy to Slack and your whole team uses the agent you built. One competitive intel agent, @ mentioned by everyone. Airtable runs its own data team this way. 5. Each agent gets its own cloud machine with a real browser and code execution. It works while you sleep. No babysitting, no local setup, no laptop that has to stay open. I put it to work in the video below. Watch what it builds. The teams treating agents as durable assets instead of one-off prompts are going to lap everyone else. This is the first tool that actually treats them that way. #ad Hyperagent

Every AI agent you've tried has amnesia. It does one task, forgets everything, and tomorrow you start from zero. That's not an employee. That's a temp you have to retrain every single morning. Hyperagent by Airtable is the first platform I've used that actually fixes this. Here's what got me: 1. Agents that compound. Each agent has memory. The one running today is smarter than the one you shipped three weeks ago. Same prompt, same integrations, but weeks of your judgment baked in. 2. Real deliverables, real receipts. You don't get a chat transcript. You get finished work with the cost and runtime printed right on it. A full research report for under ten bucks. Try getting that invoice from an agency. 3. A fleet, not a chatbot. Build a specialist for outreach, another for research, another for reporting. Give each one its own tools, its own memory, and its own budget cap so nothing runs away with your credits. 4. Deploy to Slack and your whole team uses the agent you built. One competitive intel agent, @ mentioned by everyone. Airtable runs its own data team this way. 5. Each agent gets its own cloud machine with a real browser and code execution. It works while you sleep. No babysitting, no local setup, no laptop that has to stay open. I put it to work in the video below. Watch what it builds. The teams treating agents as durable assets instead of one-off prompts are going to lap everyone else. This is the first tool that actually treats them that way. #ad Hyperagent

Leonard Rodman

94,961 Aufrufe • vor 13 Tagen

Training Volume / Intensity / Rep Range / Progressive Overload — Everything You Need To Know: (This is what will grow MOST people best) 𝗧𝗢𝗧𝗔𝗟 𝗦𝗘𝗧𝗦 𝗣𝗘𝗥 𝗪𝗘𝗘𝗞 45ish-60ish total working sets per week - If training 3x per week, this will mean 16, 17, 18ish sets per session - If training 4x per week, this will mean 13, 14, 15ish sets per session - If training 5x per week, this will mean 10, 11, 12ish sets per session 𝗧𝗢𝗧𝗔𝗟 𝗦𝗘𝗧𝗦 𝗣𝗘𝗥 𝗕𝗢𝗗𝗬 𝗣𝗔𝗥𝗧 𝗣𝗘𝗥 𝗪𝗘𝗘𝗞 - For balanced development, you’re going to want to perform 5, 6, 7, 8ish sets per body part per week - If prioritizing a muscle group, you’re going to want to perform 8, 9, 10, MAYBE 10+ sets for that body part each week - If deprioritizing a muscle group, you only need 2, 3, 4ish sets for that body part each week to maintain existing development 𝗙𝗥𝗘𝗤𝗨𝗘𝗡𝗖𝗬 In all likelihood, you will get MORE (in the way of stimulus) by splitting the work you do for a given muscle group across 2 sessions per week Splitting the work you do for a given muscle group across 3 sessions per week can work as well but the potential benefit is probably NOT that large and it diminishes the margin of safety Performing all the work you do for a given muscle group on ONE day (Bro Split Style) can work but it like has an opportunity cost associated with it 𝗧𝗢𝗧𝗔𝗟 𝗦𝗘𝗧𝗦 𝗣𝗘𝗥 𝗘𝗫𝗘𝗥𝗖𝗜𝗦𝗘 The sweet spot is generally 2-3 sets for a given exercise in a given session 1 set is fine depending on the context of the programming as a whole but you likely didn’t squeeze all the juice out of the lemon If you preform 4+ sets of a given exercise in a given session, what the fuck were you doing the first couple of sets? 𝗜𝗡𝗧𝗘𝗡𝗦𝗜𝗧𝗬 The intensity you take sets to can GREATLY IMPACT how many total sets you can perform while still allowing for adequate recovery from session to session Generally speaking, it is a good idea to leave about 1 RIR on most exercises to ensure stimulus is robust but fatigue is kept at bay There is one HUGE caveat to that however: If you do not trust your ability to accurately gauge RIR, it is better to just take your sets to 0 RIR/Failure than it is to risk sandbagging sets by leaving an incidental 2, 3, 4+ reps in the tank…just know you will not be able to generate as high a net stimulus throughout the week if you live in this intensity range 𝗥𝗲𝗽 𝗥𝗮𝗻𝗴𝗲/𝗣𝗿𝗼𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗢𝘃𝗲𝗿𝗹𝗼𝗮𝗱 Pick a weight you can do for 5-6ish reps with GOOD-GREAT technique @ the prescribed RIR (should be 0-2 RIR) Once you can hit 7, 8, 9, 10ish reps with the same GOOD-GREAT technique @ the prescribed RIR, increase the load You can weight select on a SET BY SET BASIS — this means in theory some of your sets could be heavier/lighter than others (assuming you’re doing multiple sets of a given exercise on a given day)

Training Volume / Intensity / Rep Range / Progressive Overload — Everything You Need To Know: (This is what will grow MOST people best) 𝗧𝗢𝗧𝗔𝗟 𝗦𝗘𝗧𝗦 𝗣𝗘𝗥 𝗪𝗘𝗘𝗞 45ish-60ish total working sets per week - If training 3x per week, this will mean 16, 17, 18ish sets per session - If training 4x per week, this will mean 13, 14, 15ish sets per session - If training 5x per week, this will mean 10, 11, 12ish sets per session 𝗧𝗢𝗧𝗔𝗟 𝗦𝗘𝗧𝗦 𝗣𝗘𝗥 𝗕𝗢𝗗𝗬 𝗣𝗔𝗥𝗧 𝗣𝗘𝗥 𝗪𝗘𝗘𝗞 - For balanced development, you’re going to want to perform 5, 6, 7, 8ish sets per body part per week - If prioritizing a muscle group, you’re going to want to perform 8, 9, 10, MAYBE 10+ sets for that body part each week - If deprioritizing a muscle group, you only need 2, 3, 4ish sets for that body part each week to maintain existing development 𝗙𝗥𝗘𝗤𝗨𝗘𝗡𝗖𝗬 In all likelihood, you will get MORE (in the way of stimulus) by splitting the work you do for a given muscle group across 2 sessions per week Splitting the work you do for a given muscle group across 3 sessions per week can work as well but the potential benefit is probably NOT that large and it diminishes the margin of safety Performing all the work you do for a given muscle group on ONE day (Bro Split Style) can work but it like has an opportunity cost associated with it 𝗧𝗢𝗧𝗔𝗟 𝗦𝗘𝗧𝗦 𝗣𝗘𝗥 𝗘𝗫𝗘𝗥𝗖𝗜𝗦𝗘 The sweet spot is generally 2-3 sets for a given exercise in a given session 1 set is fine depending on the context of the programming as a whole but you likely didn’t squeeze all the juice out of the lemon If you preform 4+ sets of a given exercise in a given session, what the fuck were you doing the first couple of sets? 𝗜𝗡𝗧𝗘𝗡𝗦𝗜𝗧𝗬 The intensity you take sets to can GREATLY IMPACT how many total sets you can perform while still allowing for adequate recovery from session to session Generally speaking, it is a good idea to leave about 1 RIR on most exercises to ensure stimulus is robust but fatigue is kept at bay There is one HUGE caveat to that however: If you do not trust your ability to accurately gauge RIR, it is better to just take your sets to 0 RIR/Failure than it is to risk sandbagging sets by leaving an incidental 2, 3, 4+ reps in the tank…just know you will not be able to generate as high a net stimulus throughout the week if you live in this intensity range 𝗥𝗲𝗽 𝗥𝗮𝗻𝗴𝗲/𝗣𝗿𝗼𝗴𝗿𝗲𝘀𝘀𝗶𝘃𝗲 𝗢𝘃𝗲𝗿𝗹𝗼𝗮𝗱 Pick a weight you can do for 5-6ish reps with GOOD-GREAT technique @ the prescribed RIR (should be 0-2 RIR) Once you can hit 7, 8, 9, 10ish reps with the same GOOD-GREAT technique @ the prescribed RIR, increase the load You can weight select on a SET BY SET BASIS — this means in theory some of your sets could be heavier/lighter than others (assuming you’re doing multiple sets of a given exercise on a given day)

Dean Turner

18,116 Aufrufe • vor 5 Monaten

I am stocked to announce that I won the OpenAI Developers Codex x Mollie Hacka Worldwide Hackathon in Paris. 60+ builders, every one of us working solo, one day to ship. I built mine around a single question: who gets to own intelligence? The default answer is scary. You hand your data to a handful of labs, they train the model, they own it, and you rent back a thin slice of what your own data made possible. That is the bargain on the table today. I do not accept it. So I built Lensemble: a Tapestry like distributed training platform for JEPA based World Models. What does it enable: World Models that a community improves together, keeps sovereign, and co-owns. Two bets sit underneath it. First, the paradigm. Language models predict the next token. Powerful for text, a dead end for the physical world. A robot does not need to autocomplete sentences, it needs to predict what happens next in the world. That is what JEPA does: it learns by predicting representations instead of pixels or tokens. I am convinced world models are the most underrated paradigm in AI right now, and the closest thing we have to a ChatGPT moment for robotics. Second, the politics. Your raw trajectories never leave your machine. Each participant trains locally against a shared protocol and ships only an update, never the data. A federated round folds those updates into one shared world model, a LeWorldModel based model, and the gain is measured, not claimed: a 12k-parameter adapter on a frozen backbone, held-out prediction error down about 12 percent, the model measurably less surprised by the world. Then the upside is split by contribution weight, so the people who improved the model own a share of what it earns. This is the thesis behind Project Tapestry, the AI Alliance and Yann LeCun's push for federated, sovereign frontier AI, carried into world models and robotics. Call it Tapestry for the physical world. All of it built solo, in a single day, with Codex as my pair the whole way. Thank you to OpenAI Codex and Mollie for backing builders who ship real things, and to Boris and the organizing crew for the room and the standard you set. Intelligence the world improves, and the world owns. That is the future I want for my kids, and the one I will keep building.

I am stocked to announce that I won the OpenAI Developers Codex x Mollie Hacka Worldwide Hackathon in Paris. 60+ builders, every one of us working solo, one day to ship. I built mine around a single question: who gets to own intelligence? The default answer is scary. You hand your data to a handful of labs, they train the model, they own it, and you rent back a thin slice of what your own data made possible. That is the bargain on the table today. I do not accept it. So I built Lensemble: a Tapestry like distributed training platform for JEPA based World Models. What does it enable: World Models that a community improves together, keeps sovereign, and co-owns. Two bets sit underneath it. First, the paradigm. Language models predict the next token. Powerful for text, a dead end for the physical world. A robot does not need to autocomplete sentences, it needs to predict what happens next in the world. That is what JEPA does: it learns by predicting representations instead of pixels or tokens. I am convinced world models are the most underrated paradigm in AI right now, and the closest thing we have to a ChatGPT moment for robotics. Second, the politics. Your raw trajectories never leave your machine. Each participant trains locally against a shared protocol and ships only an update, never the data. A federated round folds those updates into one shared world model, a LeWorldModel based model, and the gain is measured, not claimed: a 12k-parameter adapter on a frozen backbone, held-out prediction error down about 12 percent, the model measurably less surprised by the world. Then the upside is split by contribution weight, so the people who improved the model own a share of what it earns. This is the thesis behind Project Tapestry, the AI Alliance and Yann LeCun's push for federated, sovereign frontier AI, carried into world models and robotics. Call it Tapestry for the physical world. All of it built solo, in a single day, with Codex as my pair the whole way. Thank you to OpenAI Codex and Mollie for backing builders who ship real things, and to Boris and the organizing crew for the room and the standard you set. Intelligence the world improves, and the world owns. That is the future I want for my kids, and the one I will keep building.

abdel

17,370 Aufrufe • vor 1 Monat

Using Claude Fable 5, I built a model that predicts the entire 2026 FIFA world cup.. every single game, not just the final.. so let me break the whole thing down. what it does, how it works, and exactly how i built it.. #1 First what it does: it predicts all 104 games of the tournament. not just who lifts the trophy, but every group match, every knockout, the full path from the round of 32 to the final.. everything lands in one dashboard: > group stage, every match with each team's win % and the chance of a draw > standings, how all 12 groups are projected to finish > bracket, the full knockout tree with each team's odds of advancing > champion odds, who's most likely to actually win it all and it doesn't freeze after one prediction. the moment a real game is played, it locks that result in and re-runs everything around it. so the odds move live as the tournament goes, week by week you watch favorites rise and contenders collapse. #2. How it works: the core idea is simple. the model only ever predicts one thing, a single match. the real trick is the repetition. it learns from decades of match history, then plays the whole tournament out from the first game to the final, tens of thousands of times. each run it records who advanced and who won. do that enough and you stop getting one guess and start getting real odds, one team lifts the trophy in maybe 14% of the runs, another in 9%, and so on. #3. So, how i built it ? i didn't hand-write most of the code. i broke the project into 4 pieces, described each one to fable, and let it build while i focused on getting the football logic exactly right. - The data every international match going back over a century, around 50,000 games, plus each team's elo rating, which is the truest measure of strength, and the official 2026 schedule. garbage data means garbage predictions, so this part mattered most. - The features i turned that raw history into signals the model can learn from, the elo gap between the two teams, recent form, goals scored and conceded, and a home boost for the hosts, usa, canada and mexico. - The model for each match it predicts the expected goals for both sides, then turns that into win, draw and loss probabilities plus a likely scoreline. that's what feeds the simulation. - The tournament engine this was the hard part. the 2026 world cup is brand new, 48 teams, 12 groups, a round of 32 that's never existed before, and 8 "best third-placed" teams that slot into the bracket by a fixed fifa table. even the group tiebreakers changed this year, head to head now counts before goal difference. get any of it wrong and the whole bracket falls apart, so i built it carefully and tested the format until it was exact, then wrapped it in a simulation loop that plays the tournament out tens of thousands of times. and the last piece, the live part. as real results come in, they get locked, and only the unplayed games get re-simulated. that's what makes it a living model instead of a one-time prediction. all of it outputs to a clean dashboard you can actually read and screenshot.. right now, before kickoff, it already has a clear favorite to lift the trophy.. 👀 btw who's your pick to win the 2026 world cup?

Axel Bitblaze 🪓

49,714 Aufrufe • vor 1 Monat

This workflow combining Loom’s AI features + a custom ChatGPT GPT is saving me hours. Instead of creating onboarding Docs for new team members, I film a Loom → generate SOP → train a GPT to answer questions Game changer for businesses to delegate faster. Here's how to do it: First, I record a video of whatever task I want to delegate to the new team member with Loom. The more in-depth, the better, but I just used a 7-minute video. Then, I use Loom's new AI 'Write a document' feature. Upgrading to Loom AI from the standard Loom plan cost me $2. Loom AI can generate an entire SOP, PR description, step-by-step guide, QA, and more from a simple Loom video in <5 seconds. In the past, I’ve spent 2 hours+ hand-writing each one of these docs to onboard new team members, so Loom AI is already a massive timesaver. But it gets even better! Next, we can take that data from the SOP document, and we use it as 'Knowledge' to train a Custom GPT that can answer the new team members' questions. The more SOP docs/Knowledge you feed the GPT, the better. But one is fine if that's all you have because the GPT will pull any unknown answers from the web or its training data. Here are the prompt Instructions you want to put into the Custom GPT (copy and paste this): You are an expert CEO, specialized in onboarding and training new team members. Using the Knowledge provided, you will help new team members with any questions or stipulations they may have about their new role. Stick as true to the data provided as possible, but if= they ask any questions that the Knowledge base does not have a specific answer for, you are permitted to use your pre-trained data and/or web browsing capabilities. That's it! It can't replace you entirely, but it'll save you 90% of the time you would've wasted on writing an SOP doc and answering questions. Simple AI workflows here and there really add up. There's also a workflow to help with the job screening process, but I'll save that for another day :^)

This workflow combining Loom’s AI features + a custom ChatGPT GPT is saving me hours. Instead of creating onboarding Docs for new team members, I film a Loom → generate SOP → train a GPT to answer questions Game changer for businesses to delegate faster. Here's how to do it: First, I record a video of whatever task I want to delegate to the new team member with Loom. The more in-depth, the better, but I just used a 7-minute video. Then, I use Loom's new AI 'Write a document' feature. Upgrading to Loom AI from the standard Loom plan cost me $2. Loom AI can generate an entire SOP, PR description, step-by-step guide, QA, and more from a simple Loom video in <5 seconds. In the past, I’ve spent 2 hours+ hand-writing each one of these docs to onboard new team members, so Loom AI is already a massive timesaver. But it gets even better! Next, we can take that data from the SOP document, and we use it as 'Knowledge' to train a Custom GPT that can answer the new team members' questions. The more SOP docs/Knowledge you feed the GPT, the better. But one is fine if that's all you have because the GPT will pull any unknown answers from the web or its training data. Here are the prompt Instructions you want to put into the Custom GPT (copy and paste this): You are an expert CEO, specialized in onboarding and training new team members. Using the Knowledge provided, you will help new team members with any questions or stipulations they may have about their new role. Stick as true to the data provided as possible, but if= they ask any questions that the Knowledge base does not have a specific answer for, you are permitted to use your pre-trained data and/or web browsing capabilities. That's it! It can't replace you entirely, but it'll save you 90% of the time you would've wasted on writing an SOP doc and answering questions. Simple AI workflows here and there really add up. There's also a workflow to help with the job screening process, but I'll save that for another day :^)

Rowan Cheung

129,424 Aufrufe • vor 2 Jahren

how to set up hermes agent step by step. built-in memory, 40+ tools, works on your phone, and what to think of hermes vs openclaw: 1. hermes is a personal AI agent that runs in your terminal. think of it like open claw but with built-in memory, 40+ tools out of the box, and 90% cheaper token costs. you install it with one command. 2. the 3 problems with open claw that hermes solves: no memory (you keep repeating yourself), constant gateway restarts, and zero visibility into what you're spending on tokens. 3. hermes remembers everything. every completed task gets saved to memory. it searches through past logs to find solutions. over time it literally gets smarter at your specific workflows. 4. connect it to open router. you see exact costs per model per task. free models rotate weekly. one founder went from $130 every five days on open claw to $10 on hermes. same output. 5. it comes preloaded with skills. apple notes, imessage, find my, browser, web search, image generation, cron jobs. no hunting for plugins. 6. connect it to obsidian so it reads your entire vault. connect it to gstack for your dev environment. create custom skills for your specific workflows. 7. the biggest money saver: have it write code once for recurring tasks. then it runs without burning tokens every time. stop paying an LLM to do the same scrape or report daily. 8. run it on android via telegram. name your agents. talk to them like coworkers. in this episode imran shows you how to set this up. 9. you can run it bare metal, in docker, or serverless on modal. pick your risk level. i begged imran to come on The Startup Ideas Podcast (SIP) 🧃 and walk through the full installation live. he made it impossibly clear. if you've heard of Hermes Agent and want the clearest explanation of how to get set up like a pro let me know what you want me to cover on the next ep this is the best personal agent setup video on the internet right now. watch

how to set up hermes agent step by step. built-in memory, 40+ tools, works on your phone, and what to think of hermes vs openclaw: 1. hermes is a personal AI agent that runs in your terminal. think of it like open claw but with built-in memory, 40+ tools out of the box, and 90% cheaper token costs. you install it with one command. 2. the 3 problems with open claw that hermes solves: no memory (you keep repeating yourself), constant gateway restarts, and zero visibility into what you're spending on tokens. 3. hermes remembers everything. every completed task gets saved to memory. it searches through past logs to find solutions. over time it literally gets smarter at your specific workflows. 4. connect it to open router. you see exact costs per model per task. free models rotate weekly. one founder went from $130 every five days on open claw to $10 on hermes. same output. 5. it comes preloaded with skills. apple notes, imessage, find my, browser, web search, image generation, cron jobs. no hunting for plugins. 6. connect it to obsidian so it reads your entire vault. connect it to gstack for your dev environment. create custom skills for your specific workflows. 7. the biggest money saver: have it write code once for recurring tasks. then it runs without burning tokens every time. stop paying an LLM to do the same scrape or report daily. 8. run it on android via telegram. name your agents. talk to them like coworkers. in this episode imran shows you how to set this up. 9. you can run it bare metal, in docker, or serverless on modal. pick your risk level. i begged imran to come on The Startup Ideas Podcast (SIP) 🧃 and walk through the full installation live. he made it impossibly clear. if you've heard of Hermes Agent and want the clearest explanation of how to get set up like a pro let me know what you want me to cover on the next ep this is the best personal agent setup video on the internet right now. watch

GREG ISENBERG

616,663 Aufrufe • vor 3 Monaten