Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

New course: Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys LMSYS Org and RadixArk RadixArk, and taught by Richard Chen Richard Chen, a Member of Technical Staff at RadixArk. Running LLMs in production is expensive, and much of that cost comes from redundant computation.... This short course teaches you to eliminate that waste using SGLang, an open-source inference framework that caches computation already done and reuses it across future requests. When ten users share the same system prompt, SGLang processes it once, not ten times. The speedups compound quickly, especially when there's a lot of shared context across requests. Skills you'll gain: - Implement a KV cache from scratch to eliminate redundant computation within a single request - Scale caching across users and requests with RadixAttention, so shared context is only processed once - Accelerate image generation with diffusion models using SGLang's caching and multi-GPU parallelism Join and learn to make LLM inference faster and more cost-efficient at scale!show more

Andrew Ng

1,684,460 subscribers

99,125 görüntüleme • 3 ay önce •via X (Twitter)

Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

0 Yorum

Yorum bulunmuyor

Orijinal gönderinin yorumları burada görünecek

Benzer Videolar

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you'll learn to reduce a model's memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management. Skills you'll gain: - Quantize a model and measure the accuracy tradeoff - Serve a model with vLLM and watch it handle concurrent requests efficiently - Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy Join and learn to serve LLMs efficiently:

Andrew Ng

128,334 görüntüleme • 1 ay önce

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

The Cost of Intelligence is Heading to Zero | Hyperspace P2P Distributed Cache We present to you our breakthrough cross-domain work across AI, distributed systems, cryptography, game theory to solve the primary structural inefficiency at the heart of AI infrastructure: most inference is redundant. Google has reported that only 15% of daily searches are truly novel. The rest are repeats or close variants. LLM inference inherits this same power-law distribution. Enterprise chatbots see 70-80% of queries fall into a handful of intent categories. System prompts are identical across 100% of requests within an application. The KV attention state for "You are a helpful assistant" has been computed billions of times, on millions of GPUs, identically. And yet every AI lab, every startup, every self-hosted deployment - computes and caches these results independently. There is no shared layer. No global memory. Every provider pays the full compute cost for every query, even when the answer already exists somewhere in the network. This is the problem Hyperspace solves where distributed cache operates at three levels, each catching a different class of redundancy: 1. Response cache Same prompt, same model, same parameters - instant cached response from any node in the network. SHA-256 hash lookup via DHT, with cryptographic cache proofs linking every response to its original inference execution. No trust required. Fetchers re-announce as providers, so popular responses replicate naturally across more nodes. 2. KV prefix cache Same system prompt tokens - skip the most expensive part of inference entirely. Prefill (computing Key-Value attention states) is deterministic: same model plus same tokens always produces identical KV state. The network caches these states using erasure coding and distributes them via the routing network. New questions that share a common prefix resume generation from cached state instead of recomputing from scratch. 3. Routing to cached nodes Instead of transferring KV state across the network for every request, Hyperspace routes the request to the node that already has the state loaded in VRAM. The request goes to the cache, not the cache to the request. Together, these three layers mean that 70-90% of inference requests at network scale never require full GPU computation. This work doesn't exist in isolation. It builds on research from across the industry: SGLang's RadixAttention demonstrated that automatic prefix sharing can yield up to 5x speedup on structured LLM workloads. Moonshot AI's Mooncake built an entire KV-cache-centric disaggregated architecture for production serving at Kimi. Anthropic, OpenAI, and Google all launched prompt caching products in 2024 - priced at 50-90% discounts - because system prompt reuse is so pervasive that it changes the economics of inference. What all of these systems share is a common limitation: they operate within a single organization's infrastructure. SGLang caches prefixes within one server. Mooncake disaggregates KV cache within one datacenter. Anthropic's prompt caching works within one API provider's fleet. None of them can share cached state across organizational boundaries. Hyperspace removes this boundary. The cache is global. A response computed by a node in Tokyo is immediately available to a node in Berlin. A KV prefix state generated for Qwen-32B on one machine is verifiable and reusable by any other machine running the same model. The routing network provides the delivery guarantees, the erasure coding provides the redundancy, and the cache proofs provide the trust. What this means for the cost of intelligence Big AI labs scale linearly: twice the users means twice the GPU spend. Every query is a cost center. Their internal caching helps, but it's siloed - Lab A's cache can't serve Lab B's users, and neither can serve a self-hosted Llama deployment. Hyperspace scales sub-linearly. Every new node that joins the network adds to the global cache. Every inference result enriches the cache for all future requests. The cache hit rate rises with network size because query distributions follow a power law - the most common questions are asked exponentially more often than rare ones. The implication is simple: as the network grows, the effective cost per inference drops. Not linearly. Logarithmically. At 10 million nodes, we estimate 75-90% of all inference requests can be served from cache, eliminating 400,000+ MWh of energy consumption per year and avoiding over 200,000 tons of CO2 emissions. The first person to ask a question pays the compute cost. Everyone after them gets the answer for free, with cryptographic proof that it's authentic. Training is competitive. Inference is shared Open-weight models are converging on quality with closed models. Labs will continue to differentiate on training - data curation, architecture innovation, RLHF tuning. That's where the real intellectual property lives. But inference is a commodity. Two copies of Qwen-32B running the same prompt produce the same KV state and the same response, byte for byte, regardless of whose GPU runs the matrix multiplication. There is no moat in multiplying matrices. The moat is in training the weights. A global distributed cache makes this separation explicit. It doesn't matter who trained the model. Once the weights are open, the inference cost approaches zero at scale - because the network remembers every answer and can prove it's correct. No lab, no matter how well-funded, can match this. They cannot share caches across competitors. They scale linearly. The network scales logarithmically. The marginal cost of intelligence approaches zero. That's the endgame.

Varun

37,542 görüntüleme • 4 ay önce

New course: Build and Train an LLM with JAX, built in partnership with Google and taught by Chris Achard. JAX is the open-source library behind Google's Gemini, Veo, and other advanced models. This short course teaches you to build and train a 20-million parameter language model from scratch using JAX and its ecosystem of tools. You'll implement a complete MiniGPT-style architecture from scratch, train it, and chat with your finished model through a graphical interface. Skills you'll gain: - Learn JAX's core primitives: automatic differentiation, JIT compilation, and vectorized execution - Build a MiniGPT-style LLM using Flax/NNX, implementing embedding and transformer blocks - Load a pretrained MiniGPT model and run inference through a chat interface Come learn this important software layer for building LLMs!

New course: Build and Train an LLM with JAX, built in partnership with Google and taught by Chris Achard. JAX is the open-source library behind Google's Gemini, Veo, and other advanced models. This short course teaches you to build and train a 20-million parameter language model from scratch using JAX and its ecosystem of tools. You'll implement a complete MiniGPT-style architecture from scratch, train it, and chat with your finished model through a graphical interface. Skills you'll gain: - Learn JAX's core primitives: automatic differentiation, JIT compilation, and vectorized execution - Build a MiniGPT-style LLM using Flax/NNX, implementing embedding and transformer blocks - Load a pretrained MiniGPT model and run inference through a chat interface Come learn this important software layer for building LLMs!

Andrew Ng

192,696 görüntüleme • 5 ay önce

New course announcement: Semantic Caching for AI Agents, taught by Tyler Hutcherson and Iliya Zhechev from Redis. Semantic caching can significantly reduce your AI application's inference costs and latency. If someone asks "How do I get a refund?" and another later asks "I want my money back," semantic caching recognizes these mean the same thing so it can use a cached response instead of making another model call. This short course takes you from building your first semantic cache from scratch to implementing production-ready systems using Redis' open-source tools. Skills you'll gain: - Build semantic caches from scratch, then implement them using Redis' SDK with production features - Measure cache performance using hit rate, precision, recall, and latency - Enhance accuracy with threshold tuning, cross-encoders, LLM validation, and fuzzy matching Join and learn to reduce your agentic AI's costs and improve speed!

New course announcement: Semantic Caching for AI Agents, taught by Tyler Hutcherson and Iliya Zhechev from Redis. Semantic caching can significantly reduce your AI application's inference costs and latency. If someone asks "How do I get a refund?" and another later asks "I want my money back," semantic caching recognizes these mean the same thing so it can use a cached response instead of making another model call. This short course takes you from building your first semantic cache from scratch to implementing production-ready systems using Redis' open-source tools. Skills you'll gain: - Build semantic caches from scratch, then implement them using Redis' SDK with production features - Measure cache performance using hit rate, precision, recall, and latency - Enhance accuracy with threshold tuning, cross-encoders, LLM validation, and fuzzy matching Join and learn to reduce your agentic AI's costs and improve speed!

Andrew Ng

62,497 görüntüleme • 8 ay önce

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase by Rubrik and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here:

Learn how to build an optimized LLM inference system from the ground up in our new short course, Efficiently Serving LLMs, built in collaboration with Predibase by Rubrik and taught by Travis Addair. Whether you're serving your own LLM or using a model hosting service, this course will give you a deep understanding of the optimizations required to efficiently serve many users at once. - Learn how LLMs generate text one token at a time, and how techniques like KV caching, continuous batching, and quantization speed things up and optimize memory usage for serving multiple users. - Benchmark the performance of these LLM optimizations to explore the trade-offs between quickly responding to an individual user’s request vs. serving many users at once. - Use techniques like low-rank adaptation (LoRA) to efficiently serve hundreds of unique, custom fine-tuned models on a single device, without sacrificing throughput. - Use Predibase's LoRAX framework to see optimization techniques in action on a real LLM server. Sign up here:

Andrew Ng

104,727 görüntüleme • 2 yıl önce

New course: Agent Memory: Building Memory-Aware Agents, built in partnership with Oracle and taught by Richmond Alake and Nacho Martínez. Many agents work well within a single session but their memory resets once the session ends. Consider a research agent working on dozens of papers across multiple days: without memory, it has no way to store and retrieve what it learned across sessions. This short course teaches you to build a memory system that enables agents to persist memory and thereby learn across sessions. You'll design a Memory Manager that handles different memory types, implement semantic tool retrieval that scales without bloating the context, and build write-back pipelines that let your agent autonomously update and refine what it knows over time. Skills you'll gain: - Build persistent memory stores for different agent memory types - Implement a Memory Manager that orchestrates how your agent reads, writes, and retrieves memory - Treat tools as procedural memory and retrieve only relevant ones at inference time using semantic search Join and learn to build agents that remember and improve over time!

New course: Agent Memory: Building Memory-Aware Agents, built in partnership with Oracle and taught by Richmond Alake and Nacho Martínez. Many agents work well within a single session but their memory resets once the session ends. Consider a research agent working on dozens of papers across multiple days: without memory, it has no way to store and retrieve what it learned across sessions. This short course teaches you to build a memory system that enables agents to persist memory and thereby learn across sessions. You'll design a Memory Manager that handles different memory types, implement semantic tool retrieval that scales without bloating the context, and build write-back pipelines that let your agent autonomously update and refine what it knows over time. Skills you'll gain: - Build persistent memory stores for different agent memory types - Implement a Memory Manager that orchestrates how your agent reads, writes, and retrieves memory - Treat tools as procedural memory and retrieve only relevant ones at inference time using semantic search Join and learn to build agents that remember and improve over time!

Andrew Ng

160,148 görüntüleme • 4 ay önce

Inkling, Thinking Machines' first open model, dropped today: 975B total / 41B active MoE, up to 1M context, reasoning natively over text, images, and audio. Serving and RL support are already live: you can run and shape it on an open stack, starting now. Day 0 support on SGLang SGLang and Miles RadixArk👇 - Inkling's new architecture (ShortConv, attention with relative positional embedding, shared expert sink MoE) is natively implemented and deeply optimized, with prefill full CUDA graph and MXFP8 KV cache - Full parameter and LoRA RL in a customized Megatron backend, train inference consistency via customized kernels, routing replay, and cross-runtime parameter synchronization - DFlash speculative decoding from Modal for low-latency serving Launching now, blog and cookbook in the comments ⬇️

Inkling, Thinking Machines' first open model, dropped today: 975B total / 41B active MoE, up to 1M context, reasoning natively over text, images, and audio. Serving and RL support are already live: you can run and shape it on an open stack, starting now. Day 0 support on SGLang SGLang and Miles RadixArk👇 - Inkling's new architecture (ShortConv, attention with relative positional embedding, shared expert sink MoE) is natively implemented and deeply optimized, with prefill full CUDA graph and MXFP8 KV cache - Full parameter and LoRA RL in a customized Megatron backend, train inference consistency via customized kernels, routing replay, and cross-runtime parameter synchronization - DFlash speculative decoding from Modal for low-latency serving Launching now, blog and cookbook in the comments ⬇️

LMSYS Org

143,521 görüntüleme • 17 gün önce

Our first short course with Anthropic! Building Towards Computer Use with Anthropic. This teaches you to build an LLM-based agent that uses a computer interface by generating mouse clicks and keystrokes. Computer Use is an important, emerging capability for LLMs that will let AI agents do many more tasks than were possible before, since it lets them interact with interfaces designed for humans to use, rather than only tools that provide explicit API access. I hope you will enjoy learning about it! This course is taught by Anthropic's Head of Curriculum, Colt_Steele. You'll learn to apply image reasoning and tool use to "use" a computer as follows: a model processes an image of the screen, analyzes it to understand what's going on, and navigates the computer via mouse clicks and keystrokes. This course goes through the key building blocks, and culminates in a demo of an AI assistant that uses a web browser to search for a research paper, downloads the PDF, and finally summarizes the paper for you. In detail, you’ll: - Learn about Anthropic's family of models, when to use which one, and make API requests to Claude - Use multi-modal prompts that combine text and image content blocks, and also work with streaming responses - Improve your prompting by using prompt templates, using XML to structure prompts, and providing examples - Implement prompt caching to reduce cost and latency - Apply tool-use to build a chatbot that can call different tools to respond to queries - See all these building blocks come together in Computer Use demo Please sign up here:

Our first short course with Anthropic! Building Towards Computer Use with Anthropic. This teaches you to build an LLM-based agent that uses a computer interface by generating mouse clicks and keystrokes. Computer Use is an important, emerging capability for LLMs that will let AI agents do many more tasks than were possible before, since it lets them interact with interfaces designed for humans to use, rather than only tools that provide explicit API access. I hope you will enjoy learning about it! This course is taught by Anthropic's Head of Curriculum, Colt_Steele. You'll learn to apply image reasoning and tool use to "use" a computer as follows: a model processes an image of the screen, analyzes it to understand what's going on, and navigates the computer via mouse clicks and keystrokes. This course goes through the key building blocks, and culminates in a demo of an AI assistant that uses a web browser to search for a research paper, downloads the PDF, and finally summarizes the paper for you. In detail, you’ll: - Learn about Anthropic's family of models, when to use which one, and make API requests to Claude - Use multi-modal prompts that combine text and image content blocks, and also work with streaming responses - Improve your prompting by using prompt templates, using XML to structure prompts, and providing examples - Implement prompt caching to reduce cost and latency - Apply tool-use to build a chatbot that can call different tools to respond to queries - See all these building blocks come together in Computer Use demo Please sign up here:

Andrew Ng

170,425 görüntüleme • 1 yıl önce

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason about their behavior, diagnose problems like slow inference, and make smarter decisions about deployment. This course is built in partnership with AMD and taught by Sharon Zhou. You'll see how transformers generate text one token at a time, how the model decides which earlier words matter most when predicting the next one, and how techniques like quantization speed up inference on GPUs. This is not a video-only course; interactive visualizations throughout let you play with these concepts and build intuition that sticks. Skills you'll gain: - Understand why LLMs hallucinate, and RAG and chain-of-thought shape what they generate - Look inside the model to see how attention and layers combine to predict the next token - Diagnose inference bottlenecks and learn the techniques that speed up transformers on GPUs Join and understand what's really happening inside your LLMs:

Andrew Ng

120,728 görüntüleme • 2 ay önce

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

PowerInfer - a high-speed inference engine for deploying LLMs locally. Just came across this super interesting project on speeding up inference. It's not MoE but it's a simple approach that exploits the high locality in LLM inference to design a GPU-CPU hybrid inference engine. Hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons (the majority) are computed on the CPU. This approach significantly reduces GPU memory demands and CPU-GPU data transfer. It achieves an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs on a single NVIDIA RTX 4090 GPU. It's on only 18% lower than that achieved by a top-tier server-grade A100 GPU. It also significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy. There is a lot more innovation around inference that's coming fast. Really encouraged by the study on sparse computation to enhance the computational efficiency of LLMs. It's now possible to use PowerInfer with Llama 2 and Faclon 40B. Mistral-7B support is coming soon!

elvis

261,622 görüntüleme • 2 yıl önce

New course: Spec-Driven Development with Coding Agents, built in partnership with JetBrains, and taught by Paul Everitt | @pauleveritt@fosstodon.org. Vibe coding is fast, but often produces code that doesn't match what you asked for. This short course teaches you spec-driven development: write a detailed spec defining what to build, and work with your coding agent to implement it. Many of the best developers already build this way. A spec lets you control large code changes with a few words, preserve context across agent sessions, and stay in control as your project grows in complexity. Skills you'll gain: - Write a detailed specification to define your mission, tech stack, and roadmap, giving your agent the context it needs from the start - Plan, implement, and validate features in iterative loops using a spec as your agent's guide - Apply the same repeatable workflow to both new and legacy codebases - Package your workflow into a portable agent skill that works across agents and IDEs Join and write specs that keep your coding agent on track!

New course: Spec-Driven Development with Coding Agents, built in partnership with JetBrains, and taught by Paul Everitt | @[email protected]. Vibe coding is fast, but often produces code that doesn't match what you asked for. This short course teaches you spec-driven development: write a detailed spec defining what to build, and work with your coding agent to implement it. Many of the best developers already build this way. A spec lets you control large code changes with a few words, preserve context across agent sessions, and stay in control as your project grows in complexity. Skills you'll gain: - Write a detailed specification to define your mission, tech stack, and roadmap, giving your agent the context it needs from the start - Plan, implement, and validate features in iterative loops using a spec as your agent's guide - Apply the same repeatable workflow to both new and legacy codebases - Package your workflow into a portable agent skill that works across agents and IDEs Join and write specs that keep your coding agent on track!

Andrew Ng

462,094 görüntüleme • 3 ay önce

New Course: ACP: Agent Communication Protocol Learn to build agents that communicate and collaborate across different frameworks using ACP in this short course built with IBM Research's BeeAI, and taught by Sandi Besen, AI Research Engineer & Ecosystem Lead at IBM, and Nicholas Renotte, Head of AI Developer Advocacy at IBM. Building a multi-agent system with agents built or used by different teams and organizations can become challenging. You may need to write custom integrations each time a team updates their agent design or changes their choice of agentic orchestration framework. The Agent Communication Protocol (ACP) is an open protocol that addresses this challenge by standardizing how agents communicate, using a unified RESTful interface that works across frameworks. In this protocol, you host an agent inside an ACP server, which handles requests from an ACP client and passes them to the appropriate agent. Using a standardized client-server interface allows multiple teams to reuse agents across projects. It also makes it easier to switch between frameworks, replace an agent with a new version, or update a multi-agent system without refactoring the entire system. In this course, you’ll learn to connect agents through ACP. You’ll understand the lifecycle of an ACP Agent and how it compares to other protocols, such as MCP (Model Context Protocol) and A2A (Agent-to-Agent). You’ll build ACP-compliant agents and implement both sequential and hierarchical workflows of multiple agents collaborating using ACP. Through hands-on exercises, you’ll build: - A RAG agent with CrewAI and wrap it inside an ACP server. - An ACP Client to make calls to the ACP server you created. - A sequential workflow that chains an ACP server, created with Smolagents, to the RAG agent. - A hierarchical workflow using a router agent that transforms user queries into tasks, delegated to agents available through ACP servers. - An agent that uses MCP to access tools and ACP to communicate with other agents. You’ll finish up by importing your ACP agents into the BeeAI platform, an open-source registry for discovering and sharing agents. ACP enables collaboration between agents across teams and organizations. By the end of this course, you’ll be able to build ACP agents and workflows that communicate and collaborate regardless of framework. Please sign up here:

New Course: ACP: Agent Communication Protocol Learn to build agents that communicate and collaborate across different frameworks using ACP in this short course built with IBM Research's BeeAI, and taught by Sandi Besen, AI Research Engineer & Ecosystem Lead at IBM, and Nicholas Renotte, Head of AI Developer Advocacy at IBM. Building a multi-agent system with agents built or used by different teams and organizations can become challenging. You may need to write custom integrations each time a team updates their agent design or changes their choice of agentic orchestration framework. The Agent Communication Protocol (ACP) is an open protocol that addresses this challenge by standardizing how agents communicate, using a unified RESTful interface that works across frameworks. In this protocol, you host an agent inside an ACP server, which handles requests from an ACP client and passes them to the appropriate agent. Using a standardized client-server interface allows multiple teams to reuse agents across projects. It also makes it easier to switch between frameworks, replace an agent with a new version, or update a multi-agent system without refactoring the entire system. In this course, you’ll learn to connect agents through ACP. You’ll understand the lifecycle of an ACP Agent and how it compares to other protocols, such as MCP (Model Context Protocol) and A2A (Agent-to-Agent). You’ll build ACP-compliant agents and implement both sequential and hierarchical workflows of multiple agents collaborating using ACP. Through hands-on exercises, you’ll build: - A RAG agent with CrewAI and wrap it inside an ACP server. - An ACP Client to make calls to the ACP server you created. - A sequential workflow that chains an ACP server, created with Smolagents, to the RAG agent. - A hierarchical workflow using a router agent that transforms user queries into tasks, delegated to agents available through ACP servers. - An agent that uses MCP to access tools and ACP to communicate with other agents. You’ll finish up by importing your ACP agents into the BeeAI platform, an open-source registry for discovering and sharing agents. ACP enables collaboration between agents across teams and organizations. By the end of this course, you’ll be able to build ACP agents and workflows that communicate and collaborate regardless of framework. Please sign up here:

Andrew Ng

105,343 görüntüleme • 1 yıl önce

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x. All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework. On a 4-token prompt with 252 generated tokens: - Original: 0.76 tok/s - KV cache fp32: 27.21 tok/s - KV cache int8 (quantized): 27.29 tok/s Try it out yourself here: In practice: - KV caching gave us about a 35x end-to-end speedup - INT8 KV cache kept roughly the same speed as fp32 but cut KV cache memory by 3.78x FP32 cache used 4.5 MB in this run while the INT8 cache used only 1.19 MB This simple change to inference created a huge impact on performance. To learn more about the KV cache and other optimizations like this, check out the blog at

Reese Chong

53,026 görüntüleme • 3 ay önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 3 yıl önce

Announcing a new Coursera course: Retrieval Augmented Generation (RAG) You'll learn to build high performance, production-ready RAG systems in this hands-on, in-depth course created by and taught by , experienced AI and ML engineer, researcher, and educator. RAG is a critical component today of many LLM-based applications in customer support, internal company Q&A systems, even many of the leading chatbots that use web search to answer your questions. This course teaches you in-depth how to make RAG work well. LLMs can produce generic or outdated responses, especially when asked specialized questions not covered in its training data. RAG is the most widely used technique for addressing this. It brings in data from new data sources, such as internal documents or recent news, to give the LLM the relevant context to private, recent, or specialized information. This lets it generate more grounded and accurate responses. In this course, you’ll learn to design and implement every part of a RAG system, from retrievers to vector databases to generation to evals. You’ll learn about the fundamental principles behind RAG and how to optimize it at both the component and whole-system levels. As AI evolves, RAG is evolving too. New models can handle longer context windows, reason more effectively, and can be parts of complex agentic workflows. One exciting growth area is Agentic RAG, in which an AI agent at runtime (rather than it being hardcoded at development time) autonomously decides what data to retrieve, and when/how to go deeper. Even with this evolution, access to high-quality data at runtime is essential, which is why RAG is a key part of so many applications. You'll learn via hands-on experiences to: - Build a RAG system with retrieval and prompt augmentation - Compare retrieval methods like BM25, semantic search, and Reciprocal Rank Fusion - Chunk, index, and retrieve documents using a Weaviate vector database and a news dataset - Develop a chatbot, using open-source LLMs hosted by Together AI, for a fictional store that answers product and FAQ questions - Use evals to drive improving reliability, and incorporate multi-modal data RAG is an important foundational technique. Become good at it through this course! Please sign up here:

Announcing a new Coursera course: Retrieval Augmented Generation (RAG) You'll learn to build high performance, production-ready RAG systems in this hands-on, in-depth course created by and taught by , experienced AI and ML engineer, researcher, and educator. RAG is a critical component today of many LLM-based applications in customer support, internal company Q&A systems, even many of the leading chatbots that use web search to answer your questions. This course teaches you in-depth how to make RAG work well. LLMs can produce generic or outdated responses, especially when asked specialized questions not covered in its training data. RAG is the most widely used technique for addressing this. It brings in data from new data sources, such as internal documents or recent news, to give the LLM the relevant context to private, recent, or specialized information. This lets it generate more grounded and accurate responses. In this course, you’ll learn to design and implement every part of a RAG system, from retrievers to vector databases to generation to evals. You’ll learn about the fundamental principles behind RAG and how to optimize it at both the component and whole-system levels. As AI evolves, RAG is evolving too. New models can handle longer context windows, reason more effectively, and can be parts of complex agentic workflows. One exciting growth area is Agentic RAG, in which an AI agent at runtime (rather than it being hardcoded at development time) autonomously decides what data to retrieve, and when/how to go deeper. Even with this evolution, access to high-quality data at runtime is essential, which is why RAG is a key part of so many applications. You'll learn via hands-on experiences to: - Build a RAG system with retrieval and prompt augmentation - Compare retrieval methods like BM25, semantic search, and Reciprocal Rank Fusion - Chunk, index, and retrieve documents using a Weaviate vector database and a news dataset - Develop a chatbot, using open-source LLMs hosted by Together AI, for a fictional store that answers product and FAQ questions - Use evals to drive improving reliability, and incorporate multi-modal data RAG is an important foundational technique. Become good at it through this course! Please sign up here:

Andrew Ng

124,656 görüntüleme • 1 yıl önce

New short course: LLMs as Operating Systems: Agent Memory, created with Letta, and taught by its founders Charles Packer and Sarah Wooders. An LLM's input context window has limited space. Using a longer input context also costs more and results in slower processing. So, managing what's stored in this context window is important. In the innovative paper MemGPT: Towards LLMs as Operating Systems, its authors (which include the instructors) proposed using an LLM agent to manage this context window. Their system uses a large persistent memory that stores everything that could be included in the input context, and an agent decides what is actually included. Take the example of building a chatbot that needs to remember what's been said earlier in a conversation (perhaps over many days of interaction with a user). As the conversation's length grows, the memory management agent will move information from the input context to a persistent searchable database; summarize information to keep relevant facts in the input context; and restore relevant conversation elements from further back in time. This allows a chatbot to keep what's currently most relevant in its input context memory to generate the next response. When I read the original MemGPT paper, I thought it was an innovative technique for handling memory for LLMs. The open-source Letta framework, which we'll use in this course, makes MemGPT easy to implement. It adds memory to your LLM agents and gives them transparent long-term memory. In detail, you’ll learn: - How to build an agent that can edit its own limited input context memory, using tools and multi-step reasoning - What is a memory hierarchy (an idea from computer operating systems, which use a cache to speed up memory access), and how these ideas apply to managing the LLM input context (where the input context window is a "cache" storing the most relevant information; and an agent decides what to move in and out of this to/from a larger persistent storage system) - How to implement multi-agent collaboration by letting different agents share blocks of memory This course will give you a sophisticated understanding of memory management for LLMs, which is important for chatbots having long conversations, and for complex agentic workflows. Please sign up here!

New short course: LLMs as Operating Systems: Agent Memory, created with Letta, and taught by its founders Charles Packer and Sarah Wooders. An LLM's input context window has limited space. Using a longer input context also costs more and results in slower processing. So, managing what's stored in this context window is important. In the innovative paper MemGPT: Towards LLMs as Operating Systems, its authors (which include the instructors) proposed using an LLM agent to manage this context window. Their system uses a large persistent memory that stores everything that could be included in the input context, and an agent decides what is actually included. Take the example of building a chatbot that needs to remember what's been said earlier in a conversation (perhaps over many days of interaction with a user). As the conversation's length grows, the memory management agent will move information from the input context to a persistent searchable database; summarize information to keep relevant facts in the input context; and restore relevant conversation elements from further back in time. This allows a chatbot to keep what's currently most relevant in its input context memory to generate the next response. When I read the original MemGPT paper, I thought it was an innovative technique for handling memory for LLMs. The open-source Letta framework, which we'll use in this course, makes MemGPT easy to implement. It adds memory to your LLM agents and gives them transparent long-term memory. In detail, you’ll learn: - How to build an agent that can edit its own limited input context memory, using tools and multi-step reasoning - What is a memory hierarchy (an idea from computer operating systems, which use a cache to speed up memory access), and how these ideas apply to managing the LLM input context (where the input context window is a "cache" storing the most relevant information; and an agent decides what to move in and out of this to/from a larger persistent storage system) - How to implement multi-agent collaboration by letting different agents share blocks of memory This course will give you a sophisticated understanding of memory management for LLMs, which is important for chatbots having long conversations, and for complex agentic workflows. Please sign up here!

Andrew Ng

200,788 görüntüleme • 1 yıl önce

Important new course: Agent Skills with Anthropic, built with Anthropic and taught by Elie Schoppik! Skills are constructed as folders of instructions that equip agents with on-demand knowledge and workflows. This short course teaches you how to create them following best practices. Because skills follow an open standard format, you can build them once and deploy across any skills-compatible agent, like Claude Code. What you'll learn: - Create custom skills for code generation and review, data analysis, and research - Build complex workflows using Anthropic's pre-built skills (Excel, PowerPoint, skill creation) and custom skills - Combine skills with MCP and subagents to create agentic systems with specialized knowledge - Deploy the same skills across Claude Code, the Claude API, and the Claude Agent SDK Join and learn to equip agents with the specialized knowledge they need for reliable, repeatable workflows.

Important new course: Agent Skills with Anthropic, built with Anthropic and taught by Elie Schoppik! Skills are constructed as folders of instructions that equip agents with on-demand knowledge and workflows. This short course teaches you how to create them following best practices. Because skills follow an open standard format, you can build them once and deploy across any skills-compatible agent, like Claude Code. What you'll learn: - Create custom skills for code generation and review, data analysis, and research - Build complex workflows using Anthropic's pre-built skills (Excel, PowerPoint, skill creation) and custom skills - Combine skills with MCP and subagents to create agentic systems with specialized knowledge - Deploy the same skills across Claude Code, the Claude API, and the Claude Agent SDK Join and learn to equip agents with the specialized knowledge they need for reliable, repeatable workflows.

Andrew Ng

888,384 görüntüleme • 6 ay önce

Jensen Huang explains the difference between AI and traditional Software and why there's no bubble. Traditional software was pre-compiled, meaning it was built once, stored, and then executed repeatedly with little computation. It didn’t need constant high-power processing once finished. Users simply ran the completed program as a tool. AI, by contrast, generates its output in real time. It must process context, reason, and produce intelligence at the moment of use, not in advance. This requires ongoing computation for every request. Because of that, AI systems depend on continuous GPU power to “manufacture” responses like a factory producing tokens. So instead of static software tools, AI is an active computational process that needs large-scale, always-on infrastructure to create intelligence dynamically. --- From 'FT Live' YT channel (link in comment)

Jensen Huang explains the difference between AI and traditional Software and why there's no bubble. Traditional software was pre-compiled, meaning it was built once, stored, and then executed repeatedly with little computation. It didn’t need constant high-power processing once finished. Users simply ran the completed program as a tool. AI, by contrast, generates its output in real time. It must process context, reason, and produce intelligence at the moment of use, not in advance. This requires ongoing computation for every request. Because of that, AI systems depend on continuous GPU power to “manufacture” responses like a factory producing tokens. So instead of static software tools, AI is an active computational process that needs large-scale, always-on infrastructure to create intelligence dynamically. --- From 'FT Live' YT channel (link in comment)

Rohan Paul

761,107 görüntüleme • 8 ay önce

Learn to train an LLM with distributed data while ensuring privacy using federated learning in a new two-part short course, Intro to Federated Learning and Federated Fine-tuning of LLMs with Private Data, created with Flower and taught by Daniel J. Beutel and nic lane. Federated learning allows a single model to be trained across multiple devices, such as phones, or multiple organizations, such as hospitals, without the need to share data to a central server. This two-part course gives you an introduction to federated learning, and then teaches you how to fine-tune your large language model with distributed data using Flower Lab’s open source federated learning framework. You’ll learn: - How to use federated learning to train a variety of models, ranging from speech and vision models to LLMs, across distributed data while offering data privacy options to users and organizations. - Privacy Enhancing Technologies like differential privacy (DP), which obscures individual data by adding calibrated noise to query results. - Two variants of differential privacy - Central and Local - and how to choose depending on your use case. - How to measure and decrease bandwidth usage to make federated learning more practical and efficient with techniques like using pre-trained models and Parameter-Efficient Fine-Tuning - How federated LLM fine-tuning reduces the risk of leaking training data. Sign up here!

Learn to train an LLM with distributed data while ensuring privacy using federated learning in a new two-part short course, Intro to Federated Learning and Federated Fine-tuning of LLMs with Private Data, created with Flower and taught by Daniel J. Beutel and nic lane. Federated learning allows a single model to be trained across multiple devices, such as phones, or multiple organizations, such as hospitals, without the need to share data to a central server. This two-part course gives you an introduction to federated learning, and then teaches you how to fine-tune your large language model with distributed data using Flower Lab’s open source federated learning framework. You’ll learn: - How to use federated learning to train a variety of models, ranging from speech and vision models to LLMs, across distributed data while offering data privacy options to users and organizations. - Privacy Enhancing Technologies like differential privacy (DP), which obscures individual data by adding calibrated noise to query results. - Two variants of differential privacy - Central and Local - and how to choose depending on your use case. - How to measure and decrease bandwidth usage to make federated learning more practical and efficient with techniques like using pre-trained models and Parameter-Efficient Fine-Tuning - How federated LLM fine-tuning reduces the risk of leaking training data. Sign up here!

Andrew Ng

64,558 görüntüleme • 2 yıl önce