Loading video...

Video Failed to Load

Go Home

New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with Red Hat and taught by Cedric Clyburn. Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just...

112,086 views • 18 days ago •via X (Twitter)

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

New short course: LLMs as Operating Systems: Agent Memory, created with Letta, and taught by its founders Charles Packer and Sarah Wooders. An LLM's input context window has limited space. Using a longer input context also costs more and results in slower processing. So, managing what's stored in this context window is important. In the innovative paper MemGPT: Towards LLMs as Operating Systems, its authors (which include the instructors) proposed using an LLM agent to manage this context window. Their system uses a large persistent memory that stores everything that could be included in the input context, and an agent decides what is actually included. Take the example of building a chatbot that needs to remember what's been said earlier in a conversation (perhaps over many days of interaction with a user). As the conversation's length grows, the memory management agent will move information from the input context to a persistent searchable database; summarize information to keep relevant facts in the input context; and restore relevant conversation elements from further back in time. This allows a chatbot to keep what's currently most relevant in its input context memory to generate the next response. When I read the original MemGPT paper, I thought it was an innovative technique for handling memory for LLMs. The open-source Letta framework, which we'll use in this course, makes MemGPT easy to implement. It adds memory to your LLM agents and gives them transparent long-term memory. In detail, you’ll learn: - How to build an agent that can edit its own limited input context memory, using tools and multi-step reasoning - What is a memory hierarchy (an idea from computer operating systems, which use a cache to speed up memory access), and how these ideas apply to managing the LLM input context (where the input context window is a "cache" storing the most relevant information; and an agent decides what to move in and out of this to/from a larger persistent storage system) - How to implement multi-agent collaboration by letting different agents share blocks of memory This course will give you a sophisticated understanding of memory management for LLMs, which is important for chatbots having long conversations, and for complex agentic workflows. Please sign up here!

Andrew Ng

200,729 views • 1 year ago

New short course: Long-Term Agentic Memory with LangGraph. Learn to build an agent with long-term memory in this course developed in collaboration with taught by its Co-Founder and CEO, Harrison Chase! Personal assistance and productivity tasks have become important use cases for agents. An important feature of an AI assistant, such as a coding or calendar assistant, is its ability to keep improving over time from its experience. Agent memory is the key capability that enables this. To add memory to an agent, you must first figure out what to store and what to retrieve when it is time to use the information. Additionally, you’ll have to decide when to update the stored information. For example, you might update in each iteration loop of the agent or perform updates in the background, with a helper agent. In this course, you will learn a mental framework to build agents with long-term memory. You'll create a useful email assistant that can respond, ignore, and notify using writing, scheduling, and memory-management tools. You’ll develop your agent's memory by adding facts to its memory store, provide examples to learn the user's preferences, and optimize system prompts to evolve instructions based on previous responses. In detail, you’ll: - Learn how the three types of memory--semantic, episodic, and procedural–and the two update mechanisms–via hot path and in the background–apply to your agents. - Build an email agent with writing, scheduling, and availability tools, along with a router that triages incoming email and handles it accordingly by ignoring, responding, or notifying the user. - Add tools to your email agent that allow it to operate on semantic memory by learning facts about the user, storing them in a long-term memory store, and searching over them in future interactions. - Incorporate episodic memory, in the form of few-shot examples, in the triage step of your agents to help them learn and update user preferences. - Add procedural memory as system prompts, optimized with feedback to improve the instructions the agent follows. Learn how to approach memory in agents, and start building agents with long-term memory with LangGraph! Please sign up here:

Andrew Ng

131,640 views • 1 year ago

OpenAI just announced API access to o1 (advanced reasoning model) yesterday. I'm delighted to announce today a new short course, Reasoning with o1, built with OpenAI, and taught by Colin Jarvis, Head of AI Solutions at OpenAI, to show you how to use this effectively! Unlike previous language models which generate output directly, o1 “thinks before it responds,” and generates many reasoning tokens before returning a more thoughtful and accurate response. It is great at complex reasoning -- including planning for agentic workflows, coding, and domain-specific reasoning in STEM fields like law. But how you should use it is quite different from other LLMs. I think o1 will be a game changer for many AI applications; and in this course, you'll learn how to use it effectively. In detail, you’ll: - Learn to recognize what tasks o1 is suited for, and when to use a smaller model, or combine o1 with a smaller model - Understand the new principles of prompting reasoning models: Be simple and direct; no explicit chain-of-thought required; use structure; show rather than tell - Implement multi-step orchestration in which o1 plans, and hands tasks over to gpt-4o-mini to execute specific steps; this illustrates a design pattern to optimize intelligence (accuracy) and cost - Use o1 for a coding task to build a new application, edit existing code, and test performance by running a coding competition between o1-mini and GPT 4o - Use o1 for image understanding and learn how it performs better with a "hierarchy of reasoning," in which it incurs the latency and cost upfront, preprocessing the image and indexing it with rich details so it can be used for Q&A later - Learn a technique called meta-prompting, in which you use o1 to improve your prompts. Using a customer support evaluation set, you'll iteratively use o1 to modify a prompt to improve performance You'll also learn about how OpenAI used reinforcement learning to produce a model that uses "test-time compute" to improve performance. I think you'll find this course enjoyable and valuable. Please sign up for it here:

Andrew Ng

357,401 views • 1 year ago

#mixtral #mistral #LLM360 Serving Mixtral and LLM360 on FEDML Nexus AI ( We offer Mixtral model endpoints the cheapest in the market: only $0.0005 / 1K tokens! FEDML embraces open source and open model weights. We believe the future of AI belongs to large-scale open collaboration. Today we are excited to support new advances in open-source foundation models: Mixtral, the latest open-source LLM beating Llama2-70B with Mixture-of-Experts (MoE) architecture, and Amber and CrystalCoder backed by LLM360, the framework for open-source LLMs to foster transparency, trust, and collaborative research. Compared to existing fragmented ML products in the market, FEDML Nexus AI is the next-gen cloud service for LLM and Generative AI. It provides an end-to-end platform backed by serverless/decentralized AI infrastructure. Specifically: 1. Economical Serving Engine, ScaleLLM, is where you run your model in cheaper price by optimizing GPU memory and with fully optimized throughput for supporting more concurrent requests. 2. FEDML® Deploy simplifies CLI and MLOps workflow for model deployment on a serverless GPU cloud or on-premise cluster. 3. Serverless Endpoint runs on serverless GPU clouds. With our pay per use policy, we abstract the responsibility of acquiring or leasing an extensive GPU inventory when your are uncertain about your future AI service traffic. The autoscaling feature seamlessly adjusts the backend GPU resources in response to your service traffic. 4. On-premise Deployment helps you own your LLM model on your local environment with AI safety support. 5. FEDML® Launch for serverless GPU clouds. With one-line CLI, it swiftly pairs AI jobs with the most economical GPU resources, auto-provisions, and effortlessly runs the job, abstracting complex environment setup and management. 6. Zero-code Fine-tuning supported by FEDML® Studio optimizes your model on your domain-specific data without writing any line of source code. 7. Pre-training LLM supports cluster management and experimental tracking. You maintain your training clusters for your urgent needs in your vertical domain. As a closing note, FEDML is gearing up to unveil a cutting-edge service for LLM-based agents and our own cost-effective LLM. Please stay tuned and keep an eye out for upcoming announcements!

TensorOpera AI

90,271 views • 2 years ago