Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto... the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!show more

elvis

306,501 subscribers

261,583 views • 2 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

10 Comments

elvis2 years ago

source: github:

Adam Pippert2 years ago

This is HUGE. I always suspected that there would be a way to break up the architecture of a model to alleviate the GPU cost and availability bottleneck. Will play with this on LLaMa2, but highly anticipating Mistral-7B.

Hamid R. Darabi2 years ago

It's very interesting! 11.69x improvement over llama.cpp sounds a lot, given that it's already super efficient. Are you sure it's not 11.7%? Even that could be a good result.

Loki (cute/acc)2 years ago

I almost tried this before I realized it's one more model exchange format 🥲 ".powerinfer.gguf" Can't all of you just use onnx and move on in life? 😭

catid (e/acc)2 years ago

Nice I was working in this direction recently noting the same things. 12x speedup is very respectable compared to 8x being the best so far achieved by any one approach (e.g. pruning/quantization)!

Filippo Pedrazzini2 years ago

15 GiBs of weights? And this is supposed to run on Consumer Devices?! 🧐

Medium Boss - 70b_Float16.Q8.gguf2 years ago

The week I got a new PC with a beefy CPU. Fucking nice.

Sahar Mor2 years ago

Paper tl;dr

s3nh2 years ago

Fastest bookmark 🤫🤫🤫

Ruairi2 years ago

Noob question, is this just for the activation function or is it also cutting down the number of entries for GPU matrix multiplication as well?

Related Videos

🚨 Cerebras Inference is now 3x faster: Llama3.1-70B just broke 2,100 tokens/s - 16x faster than the fastest GPU solution - 8x faster than GPUs running Llama *3B* - It's like the perf of a new hardware generation in a single software release Available now at

🚨 Cerebras Inference is now 3x faster: Llama3.1-70B just broke 2,100 tokens/s - 16x faster than the fastest GPU solution - 8x faster than GPUs running Llama 3B - It's like the perf of a new hardware generation in a single software release Available now at

Cerebras

236,030 views • 1 year ago

GPU tradeoff series: A100 is not much more powerful than 4090 🫠 GPU Perf and Price: - 4090: 330 fp16 TFLOPs, $1,749 - A100 (80GB): 312 fp16 TFLOPs, $20,000 > A100 is 11.4X more pricy Training speed for GPT-2(124M) with llm.c: - 4090: 153K tokens/s - A100 (80GB): 195K tokens/s > A100 is only 1.3X faster (both trained using a single card, A100 llm.c training is shown in the video, 4090 video is in the quoted tweet) Conclusion: 4090 has a much better cost vs performance ratio Why: As in the H100 vs. 4090 comparison, the biggest difference between A100 and 4090 is their GPU memory size/bandwidth and cross-GPU communication bandwidth, which does not matter too much if your model can fit into a single 4090. Specs: 4090: - GPU memory size: 24GB - memory bandwidth: 1 TB/s - communication bandwidth: 64 GB/s A100: - GPU memory size: 80GB - memory bandwidth: 2 TB/s - communication bandwidth: 900 GB/s Nvidia killed off NVLink (a high-speed communication link that connects GPUs) on 4090. (Jensen Huang smiling face) If multiple 4090s could be interconnected via NVLink, their performance would be closer to datacenter-grade A100 GPUs, even for training larger models. Additionally, 4090 isn't allowed in datacenters, that's how Nvidia makes 💰💰💰

GPU tradeoff series: A100 is not much more powerful than 4090 🫠 GPU Perf and Price: - 4090: 330 fp16 TFLOPs, $1,749 - A100 (80GB): 312 fp16 TFLOPs, $20,000 > A100 is 11.4X more pricy Training speed for GPT-2(124M) with llm.c: - 4090: 153K tokens/s - A100 (80GB): 195K tokens/s > A100 is only 1.3X faster (both trained using a single card, A100 llm.c training is shown in the video, 4090 video is in the quoted tweet) Conclusion: 4090 has a much better cost vs performance ratio Why: As in the H100 vs. 4090 comparison, the biggest difference between A100 and 4090 is their GPU memory size/bandwidth and cross-GPU communication bandwidth, which does not matter too much if your model can fit into a single 4090. Specs: 4090: - GPU memory size: 24GB - memory bandwidth: 1 TB/s - communication bandwidth: 64 GB/s A100: - GPU memory size: 80GB - memory bandwidth: 2 TB/s - communication bandwidth: 900 GB/s Nvidia killed off NVLink (a high-speed communication link that connects GPUs) on 4090. (Jensen Huang smiling face) If multiple 4090s could be interconnected via NVLink, their performance would be closer to datacenter-grade A100 GPUs, even for training larger models. Additionally, 4090 isn't allowed in datacenters, that's how Nvidia makes 💰💰💰

Yuchen Jin

234,951 views • 1 year ago

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

I had to test it myself to believe this unreal inference speed. 3,000 tokens/s for 1 user on standard datacenter GPUs. They leveraged a hidden efficiency gap in how GPUs generate tokens. Kog just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding). Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. That's a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels. Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem. For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing. Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token. Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture. The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips. They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs. On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request. Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.

Rohan Paul

13,080 views • 26 days ago

Stress testing Metropolis 1998! - Population: 53,500 - CPU FPS: 80 - GPU FPS: 10-80 - RAM: 3,100 MB So: At the moment, the game is GPU bound. I know why and there's a lot of room for improvement. It will involve developing or customizing an existing GPU framework. There's also a lot of room to improve CPU/RAM, but right now it's in an okay place. #gamedev

Stress testing Metropolis 1998! - Population: 53,500 - CPU FPS: 80 - GPU FPS: 10-80 - RAM: 3,100 MB So: At the moment, the game is GPU bound. I know why and there's a lot of room for improvement. It will involve developing or customizing an existing GPU framework. There's also a lot of room to improve CPU/RAM, but right now it's in an okay place. #gamedev

Yesbox - Metropolis 1998

84,831 views • 1 year ago

Traditional GPU rentals often fluctuate due to supply bottlenecks and geopolitical factors. Aethir takes a different approach by providing bare-metal, on-demand GPU access, ensuring cost-efficient pricing and availability for AI training, inference, & agentic workflows.

Traditional GPU rentals often fluctuate due to supply bottlenecks and geopolitical factors. Aethir takes a different approach by providing bare-metal, on-demand GPU access, ensuring cost-efficient pricing and availability for AI training, inference, & agentic workflows.

Aethir

13,266 views • 4 months ago

Luminal fuses entire models into a single GPU kernel, automatically. Let's talk about why this matters for inference at the speed of light:

Luminal fuses entire models into a single GPU kernel, automatically. Let's talk about why this matters for inference at the speed of light:

Joe Fioti

24,326 views • 5 months ago

Gavin Baker (Gavin Baker) says the disaggregation of inference can extend GPU useful lives from 3-4 years to 10-15. That may single-handedly save private credit and reduce the financing rates for GPUs, which will drive demand and help finance the build-out. "The disaggregation of prefill and inference is going to be amazing for the useful lives of GPU and may single-handedly save private credit. Private credit is in pain from these SaaS loans. But there's a lot of private credit in GPUs too. They were underwriting that to 3-4. The disaggregation of inference means that these GPUs are going to have 10 or 15-year lives. The AI skeptics are like, "Oh, these companies are all cooking their books. The useful life of a GPU is only a year or two. The useful life of a CPU is only four years because the rapid technological change." No. What rapid technological change has done with the disaggregation of prefill and inference is you can put a Cerebras system or Groq LPUs effectively in front of a Hopper or even an Ampere, use that Hopper and Ampere for prefill, and extend the useful life of that GPU until it melts. This is going to be really good for the whole private credit industry. It's gonna help finance the AI build-out. Because if you can start to finance GPUs at 5% or 6% instead of – I think CoreWeave's lowest financing was low sevens – that actually mathematically changes the cost to finance this build-out."

Gavin Baker (Gavin Baker) says the disaggregation of inference can extend GPU useful lives from 3-4 years to 10-15. That may single-handedly save private credit and reduce the financing rates for GPUs, which will drive demand and help finance the build-out. "The disaggregation of prefill and inference is going to be amazing for the useful lives of GPU and may single-handedly save private credit. Private credit is in pain from these SaaS loans. But there's a lot of private credit in GPUs too. They were underwriting that to 3-4. The disaggregation of inference means that these GPUs are going to have 10 or 15-year lives. The AI skeptics are like, "Oh, these companies are all cooking their books. The useful life of a GPU is only a year or two. The useful life of a CPU is only four years because the rapid technological change." No. What rapid technological change has done with the disaggregation of prefill and inference is you can put a Cerebras system or Groq LPUs effectively in front of a Hopper or even an Ampere, use that Hopper and Ampere for prefill, and extend the useful life of that GPU until it melts. This is going to be really good for the whole private credit industry. It's gonna help finance the AI build-out. Because if you can start to finance GPUs at 5% or 6% instead of – I think CoreWeave's lowest financing was low sevens – that actually mathematically changes the cost to finance this build-out."

Invest Like the Best

206,845 views • 1 month ago

Releasing moondream-zig! It is a fast, implementation of moondream2 inference on the CPU written from-scratch in Zig :) moondream-zig provides 1.5-2x faster inference compared to huggingface on the same device. moondream vik

Releasing moondream-zig! It is a fast, implementation of moondream2 inference on the CPU written from-scratch in Zig :) moondream-zig provides 1.5-2x faster inference compared to huggingface on the same device. moondream vik

snow

29,245 views • 1 year ago

Today on Digital Foundry - The Last of Us Part 2 reviewed on PC. It's a better port than Part 1, but impactful legacy issues remain and the game is far more taxing on CPU and GPU than it should be:

Today on Digital Foundry - The Last of Us Part 2 reviewed on PC. It's a better port than Part 1, but impactful legacy issues remain and the game is far more taxing on CPU and GPU than it should be:

Digital Foundry

65,728 views • 1 year ago

Jensen Huang just identified the next $200 billion market (Save this). The shift starts with a observation about agentic AI that changes everything about infrastructure. In the era of training and inference, the GPU was everything while CPU was a traffic cop, scheduling work, managing memory, dispatching tasks while the GPU did the heavy lifting. Agentic AI breaks that model entirely. An AI agent does not just run a single inference pass but rather it plans, calls tools, executes code in sandboxes, retrieves data from multiple sources and loops through complex multi-step reasoning sequences often thousands of times per second at scale. Every one of those operations runs through the CPU and the GPU sits idle waiting for the CPU to prepare the next task, supply the right context and execute the retrieval and tool calling logic fast enough to keep the accelerators fed. The CPU is now the conductor and the GPU is the orchestra and the bottleneck is the conductor falling behind. This is showing up in production AI factory utilization right now, which is exactly why Jensen built Vera from scratch rather than licensing x86. Vera achieves 40% lower peak memory latency than x86, 50% faster core to core communication, and 1.8 times the agentic sandbox performance of current x86 processors on a purpose-built architecture designed around the agentic loop. Now here is where the investment thesis gets interesting. The obvious beneficiary is Nvidia itself, and that thesis is real. Nvidia's CFO has guided for nearly $20 billion in Vera CPU revenue this fiscal year alone, a market Nvidia had zero presence in just three years ago. Intel held 60% of server CPU market share as recently as Q4 2025 and that transition is now happening at a pace Intel structurally cannot respond to. But the deeper question is, what architecture is Vera actually built on? Vera's Olympus cores are ARM compatible and every single Vera CPU deployed in every Vera Rubin rack in every data center in the world runs on ARM architecture. And ARM Holdings collects a royalty on every one of them. ARM does not make chips but rather licenses the instruction set architecture and CPU core designs that others build on top of. Every time Nvidia ships a Vera CPU, every time a hyperscaler deploys a Vera Rubin rack, every time an enterprise qualifies Vera for their AI factory, ARM earns a royalty. The secular tailwind here is almost perfectly constructed for ARM's business model. Amazon's Graviton, Microsoft's Cobalt, Google's Axion, Apple's silicon stack, and Qualcomm's data center push all run on ARM. And now Nvidia's Vera, which is projected to displace Intel as the largest server CPU supplier by revenue in a single fiscal year, is ARM. ARM's royalty rate on high end server chips is estimated at roughly 1 to 2% of chip selling price. At $5,000 per Vera CPU and 4 million units projected for FY2027, that is a royalty line growing from near zero to potentially $400 million to $800 million annually from Nvidia's data center CPU business alone before counting Amazon, Microsoft, Google, Apple, and Qualcomm. The total ARM addressable royalty base across all the silicon it already licenses is compounding at a rate that the current $130 billion market cap does not fully reflect. Jensen's CPU thesis is the most underappreciated catalyst in ARM's fundamental story, and the royalty compounding has barely started. Come join Milk Road Pro and get our full ARM royalty model and our entire AI trade thesis. Link below!

Jensen Huang just identified the next $200 billion market (Save this). The shift starts with a observation about agentic AI that changes everything about infrastructure. In the era of training and inference, the GPU was everything while CPU was a traffic cop, scheduling work, managing memory, dispatching tasks while the GPU did the heavy lifting. Agentic AI breaks that model entirely. An AI agent does not just run a single inference pass but rather it plans, calls tools, executes code in sandboxes, retrieves data from multiple sources and loops through complex multi-step reasoning sequences often thousands of times per second at scale. Every one of those operations runs through the CPU and the GPU sits idle waiting for the CPU to prepare the next task, supply the right context and execute the retrieval and tool calling logic fast enough to keep the accelerators fed. The CPU is now the conductor and the GPU is the orchestra and the bottleneck is the conductor falling behind. This is showing up in production AI factory utilization right now, which is exactly why Jensen built Vera from scratch rather than licensing x86. Vera achieves 40% lower peak memory latency than x86, 50% faster core to core communication, and 1.8 times the agentic sandbox performance of current x86 processors on a purpose-built architecture designed around the agentic loop. Now here is where the investment thesis gets interesting. The obvious beneficiary is Nvidia itself, and that thesis is real. Nvidia's CFO has guided for nearly $20 billion in Vera CPU revenue this fiscal year alone, a market Nvidia had zero presence in just three years ago. Intel held 60% of server CPU market share as recently as Q4 2025 and that transition is now happening at a pace Intel structurally cannot respond to. But the deeper question is, what architecture is Vera actually built on? Vera's Olympus cores are ARM compatible and every single Vera CPU deployed in every Vera Rubin rack in every data center in the world runs on ARM architecture. And ARM Holdings collects a royalty on every one of them. ARM does not make chips but rather licenses the instruction set architecture and CPU core designs that others build on top of. Every time Nvidia ships a Vera CPU, every time a hyperscaler deploys a Vera Rubin rack, every time an enterprise qualifies Vera for their AI factory, ARM earns a royalty. The secular tailwind here is almost perfectly constructed for ARM's business model. Amazon's Graviton, Microsoft's Cobalt, Google's Axion, Apple's silicon stack, and Qualcomm's data center push all run on ARM. And now Nvidia's Vera, which is projected to displace Intel as the largest server CPU supplier by revenue in a single fiscal year, is ARM. ARM's royalty rate on high end server chips is estimated at roughly 1 to 2% of chip selling price. At $5,000 per Vera CPU and 4 million units projected for FY2027, that is a royalty line growing from near zero to potentially $400 million to $800 million annually from Nvidia's data center CPU business alone before counting Amazon, Microsoft, Google, Apple, and Qualcomm. The total ARM addressable royalty base across all the silicon it already licenses is compounding at a rate that the current $130 billion market cap does not fully reflect. Jensen's CPU thesis is the most underappreciated catalyst in ARM's fundamental story, and the royalty compounding has barely started. Come join Milk Road Pro and get our full ARM royalty model and our entire AI trade thesis. Link below!

Milk Road AI

11,785 views • 20 days ago

New course: Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys LMSYS Org and RadixArk RadixArk, and taught by Richard Chen Richard Chen, a Member of Technical Staff at RadixArk. Running LLMs in production is expensive, and much of that cost comes from redundant computation. This short course teaches you to eliminate that waste using SGLang, an open-source inference framework that caches computation already done and reuses it across future requests. When ten users share the same system prompt, SGLang processes it once, not ten times. The speedups compound quickly, especially when there's a lot of shared context across requests. Skills you'll gain: - Implement a KV cache from scratch to eliminate redundant computation within a single request - Scale caching across users and requests with RadixAttention, so shared context is only processed once - Accelerate image generation with diffusion models using SGLang's caching and multi-GPU parallelism Join and learn to make LLM inference faster and more cost-efficient at scale!

New course: Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys LMSYS Org and RadixArk RadixArk, and taught by Richard Chen Richard Chen, a Member of Technical Staff at RadixArk. Running LLMs in production is expensive, and much of that cost comes from redundant computation. This short course teaches you to eliminate that waste using SGLang, an open-source inference framework that caches computation already done and reuses it across future requests. When ten users share the same system prompt, SGLang processes it once, not ten times. The speedups compound quickly, especially when there's a lot of shared context across requests. Skills you'll gain: - Implement a KV cache from scratch to eliminate redundant computation within a single request - Scale caching across users and requests with RadixAttention, so shared context is only processed once - Accelerate image generation with diffusion models using SGLang's caching and multi-GPU parallelism Join and learn to make LLM inference faster and more cost-efficient at scale!

Andrew Ng

97,357 views • 2 months ago

We are excited to share a deeper look into the GPU Sharing Platform that will power AI applications on the InfraX V4 DApp. In this demo we showcase a GPU running and completing an image generation task. Our backend successfully leverages this GPU (and CPU) in real time to complete the job, earning the user money, and powering the infraX platform. Here’s how it works 👉 when you connect to the InfraX DApp, our platform automatically detects the GPU on your machine and immediately checks it eligibility and where it can be used. From there, each user dictates exactly how and when they want to contribute their GPU power by setting the days their PC is most available (throughout the night, for example). Additional, each user will be able to pause their GPU's connection to infraX whenever they need to use it locally, resuming with a just a single click. Once activated, your GPU seamlessly plugs into the InfraX network and begins fulfilling compute jobs, earning YOU money. Go on, take the $INFRA pill..

We are excited to share a deeper look into the GPU Sharing Platform that will power AI applications on the InfraX V4 DApp. In this demo we showcase a GPU running and completing an image generation task. Our backend successfully leverages this GPU (and CPU) in real time to complete the job, earning the user money, and powering the infraX platform. Here’s how it works 👉 when you connect to the InfraX DApp, our platform automatically detects the GPU on your machine and immediately checks it eligibility and where it can be used. From there, each user dictates exactly how and when they want to contribute their GPU power by setting the days their PC is most available (throughout the night, for example). Additional, each user will be able to pause their GPU's connection to infraX whenever they need to use it locally, resuming with a just a single click. Once activated, your GPU seamlessly plugs into the InfraX network and begins fulfilling compute jobs, earning YOU money. Go on, take the $INFRA pill..

infraX | $INFRA

30,615 views • 9 months ago

IN NEWS: Baseten raises a $150M series D round. Tuhin Srivastava (Founder & CEO, Baseten) on the future of inference: “I think the token price goes down and inference should get cheaper over time. And that really just means there is going to be more inference.” “Every time we lower prices or optimize models to make it cheaper, four months later customers are spending more anyway.” “Inference prices will go down, but if the world is run by AI in 10 years, there is going to be a lot of inference. It better be cheap.”

IN NEWS: Baseten raises a $150M series D round. Tuhin Srivastava (Founder & CEO, Baseten) on the future of inference: “I think the token price goes down and inference should get cheaper over time. And that really just means there is going to be more inference.” “Every time we lower prices or optimize models to make it cheaper, four months later customers are spending more anyway.” “Inference prices will go down, but if the world is run by AI in 10 years, there is going to be a lot of inference. It better be cheap.”

TBPN

17,057 views • 9 months ago

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

The latest MLX has a CUDA back-end! To get started: pip install "mlx[cuda]" With the same codebase you can develop locally, run your model on Apple silicon, or in the cloud on Nvidia GPUs. MLX is designed around Apple silicon - which has a unified memory architecture. It uses the same design with CUDA. So there's no need to move arrays around from CPU memory to GPU memory. Note, this is early days - some operations are missing and performance is still being optimized. But it's already quite fast for Transformer training, text generation, and more! Here's a demo using mlx-lm to generate text with Llama 3 8B (bf16) on an A100:

Awni Hannun

42,761 views • 11 months ago

Nvidia announces the new RTX Spark, a new platform powered by the NX1 CPU, and shows off Spark laptops running 007 First Light and Forza 6. The CPU has 20 ARM based cores and a Blackwell RTX GPU with 6144 CUDA Cores. This is the same core count as a 5070, but with 128GB of unified LPDDR5X RAM memory sitting in the same package as the CPU and GPU. The entire Nvidia software stack is available, particularly CUDA, vital for AI. Nvidia's new laptops will likely be ideal for running local LLMs be cause the unified memory means you can load models up to 120-180B parameters (quantized). These laptops are expected to ship later this year and could become strong competitors to high-end MacBooks and even Mac Studios for local AI workloads, thanks to CUDA support and unified memory. Price is unannounced.

Nvidia announces the new RTX Spark, a new platform powered by the NX1 CPU, and shows off Spark laptops running 007 First Light and Forza 6. The CPU has 20 ARM based cores and a Blackwell RTX GPU with 6144 CUDA Cores. This is the same core count as a 5070, but with 128GB of unified LPDDR5X RAM memory sitting in the same package as the CPU and GPU. The entire Nvidia software stack is available, particularly CUDA, vital for AI. Nvidia's new laptops will likely be ideal for running local LLMs be cause the unified memory means you can load models up to 120-180B parameters (quantized). These laptops are expected to ship later this year and could become strong competitors to high-end MacBooks and even Mac Studios for local AI workloads, thanks to CUDA support and unified memory. Price is unannounced.

Grummz

30,573 views • 23 days ago

$NVDA CEO: “Inference demand will go up by a billion times.” We are just in the very early innings. Compute demand will likely grow by another 10x over the next decade and current TAM projections for GPU/CPU will look too conservative in hindsight. $AMD $NVDA $INTC $ARM

$NVDA CEO: “Inference demand will go up by a billion times.” We are just in the very early innings. Compute demand will likely grow by another 10x over the next decade and current TAM projections for GPU/CPU will look too conservative in hindsight. $AMD $NVDA $INTC $ARM

Oguz Erkan

198,071 views • 1 month ago

For the first time, the latest LLMs run on the Apple Neural Engine — and NexaSDK is the only framework that makes it possible, powered by the NexaML engine. Last year, our two co-founders were invited by Apple DMLI team (Data & Machine Learning Innovation) to share their research about on-device multimodal model for local AI agents. One of the big questions in the room was: “Can the newest LLMs actually run on ANE?” At the time, nobody had a clear path. Today, that path exists. NexaSDK now runs Granite-4.0 (IBM), Qwen3 (Qwen), Gemma3 (Google), and Parakeet-v3 (NVIDIA) fully on Apple’s NPU — unlocking low-power, always-on, fast inference across Mac and iPhone. A new wave of NPU-first local AI apps is coming to Apple devices. Start with one line of code on Mac. iOS SDK coming soon.

For the first time, the latest LLMs run on the Apple Neural Engine — and NexaSDK is the only framework that makes it possible, powered by the NexaML engine. Last year, our two co-founders were invited by Apple DMLI team (Data & Machine Learning Innovation) to share their research about on-device multimodal model for local AI agents. One of the big questions in the room was: “Can the newest LLMs actually run on ANE?” At the time, nobody had a clear path. Today, that path exists. NexaSDK now runs Granite-4.0 (IBM), Qwen3 (Qwen), Gemma3 (Google), and Parakeet-v3 (NVIDIA) fully on Apple’s NPU — unlocking low-power, always-on, fast inference across Mac and iPhone. A new wave of NPU-first local AI apps is coming to Apple devices. Start with one line of code on Mac. iOS SDK coming soon.

NEXA AI

30,213 views • 7 months ago