Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the... number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.show more

AK

511,803 subscribers

141,440 views • 2 years ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 views • 1 year ago

LARP: Language-Agent Role Play for Open-World Games paper page: Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we introduce Language Agent for Role-Playing (LARP), which includes a cognitive architecture that encompasses memory processing and a decision-making assistant, an environment interaction module with a feedback-driven learnable action space, and a postprocessing method that promotes the alignment of various personalities. The LARP framework refines interactions between users and agents, predefined with unique backgrounds and personalities, ultimately enhancing the gaming experience in open-world contexts. Furthermore, it highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios.

LARP: Language-Agent Role Play for Open-World Games paper page: Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we introduce Language Agent for Role-Playing (LARP), which includes a cognitive architecture that encompasses memory processing and a decision-making assistant, an environment interaction module with a feedback-driven learnable action space, and a postprocessing method that promotes the alignment of various personalities. The LARP framework refines interactions between users and agents, predefined with unique backgrounds and personalities, ultimately enhancing the gaming experience in open-world contexts. Furthermore, it highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios.

AK

143,974 views • 2 years ago

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

Tencent presents GameGen-O Open-world Video Game Generation We introduce GameGen-O, the first diffusion transformer model tailored for the generation of open-world video games. This model facilitates high-quality, open-domain generation by simulating a wide array of game engine features, such as innovative characters, dynamic environments, complex actions, and diverse events. Additionally, it provides interactive controllability, thus allowing for the gameplay simulation. The development of GameGen-O involves a comprehensive data collection and processing effort from scratch. We collect and build the first Open-World Video Game Dataset (OGameData), amassed extensive data from over a hundred of next-generation open-world games, employing a proprietary data pipeline for efficient sorting, scoring, filtering, and decoupled captioning. This robust and extensive OGameData forms the foundation of our model's training process. GameGen-O undergoes a two-stage training process, consisting of foundation model pretraining and instruction tuning. In the first phase, the model is pre-trained on the OGameData via the text-to-video and video continuation, endowing GameGen-O with the capability for open-domain video game generation. In the second phase, the pre-trained model is frozen, and we fine-tuned using a trainable InstructNet, which enables the production of subsequent frames based on multimodal structural instructions. This whole training process imparts the model with the ability to generate and interactively control content. In summary, GameGen-O represents a notable initial step forward in the realm of open-world video game generation via generative models. It underscores the potential of generative models to serve as an alternative to rendering techniques, which can efficiently combine creative generation with interactive capabilities.

AK

367,088 views • 1 year ago

🚨🚨Can agents earn money, run a business, or even self-organize a society in the physical social world? 🤖🤖 Can agents learn continually to survive and thrive in embodied environments, like how human babies grow? 👶 Super excited to introduce SimWorld, an open-ended simulator of LLM agents in infinite, realistic embodied worlds. SimWorld features 3 key designs: 1⃣Open-ended realistic world simulation - built on Unreal Engine 5, with accurate physical social dynamics - 100+ built-in environments (city, island, wilderness ...) - language-controllable procedural generation - text-to-3D asset generation 2⃣Native interface for LLM/VLM agents - Gym-like agent-environment interaction APIs - plug in any LLMs/VLMs (GPTs, Gemini, Qwen ...) - rich multi-modal perception - open-vocabulary natural-language action outputs 3⃣Diverse physical and social reasoning scenarios - long-horizon embodied reasoning - multi-agent collaboration / competition - easily customizable for any reasoning tasks SimWorld is fully open-sourced, with a hope to become a foundational infrastructure for real-world agent research across disciplines: robotics, economy, public health, education, etc. Project website + more details in the thread👇 ...1/

🚨🚨Can agents earn money, run a business, or even self-organize a society in the physical social world? 🤖🤖 Can agents learn continually to survive and thrive in embodied environments, like how human babies grow? 👶 Super excited to introduce SimWorld, an open-ended simulator of LLM agents in infinite, realistic embodied worlds. SimWorld features 3 key designs: 1⃣Open-ended realistic world simulation - built on Unreal Engine 5, with accurate physical social dynamics - 100+ built-in environments (city, island, wilderness ...) - language-controllable procedural generation - text-to-3D asset generation 2⃣Native interface for LLM/VLM agents - Gym-like agent-environment interaction APIs - plug in any LLMs/VLMs (GPTs, Gemini, Qwen ...) - rich multi-modal perception - open-vocabulary natural-language action outputs 3⃣Diverse physical and social reasoning scenarios - long-horizon embodied reasoning - multi-agent collaboration / competition - easily customizable for any reasoning tasks SimWorld is fully open-sourced, with a hope to become a foundational infrastructure for real-world agent research across disciplines: robotics, economy, public health, education, etc. Project website + more details in the thread👇 ...1/

Lianhui Qin

65,079 views • 7 months ago

Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments? We present MaestroMotif, a method for AI-assisted skill design that produces highly capable and steerable hierarchical agents. To the best of our knowledge, it is the first method that, without expert labeled datasets, solves compositional tasks requiring hundreds of steps for completion. All the modules within MaestroMotif are learned from interaction: from the highest level of planning to the lowest-level of sensorimotor control. On the open-ended domain of NetHack, it surpasses existing approaches, including those that are fine-tuned specifically for each task. At the heart of MaestroMotif is the idea that decomposing a task into subtasks significantly helps decision making. MaestroMotif leverages an agent designer's intuition about a domain to identify important skills and describe them in natural language. These short descriptions then get converted into adaptable hierarchical agents through AI feedback and in-context learning. Our paper was recently published at ICLR 2025 and we open-source the whole project including the code, prompts and pre-trained models. Paper: Code: NotebookLM Podcast: This work was done with the amazing Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, with equal supervision by Marlos C. Machado and Pierluca D'Oro. Take a look at the following thread:

Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments? We present MaestroMotif, a method for AI-assisted skill design that produces highly capable and steerable hierarchical agents. To the best of our knowledge, it is the first method that, without expert labeled datasets, solves compositional tasks requiring hundreds of steps for completion. All the modules within MaestroMotif are learned from interaction: from the highest level of planning to the lowest-level of sensorimotor control. On the open-ended domain of NetHack, it surpasses existing approaches, including those that are fine-tuned specifically for each task. At the heart of MaestroMotif is the idea that decomposing a task into subtasks significantly helps decision making. MaestroMotif leverages an agent designer's intuition about a domain to identify important skills and describe them in natural language. These short descriptions then get converted into adaptable hierarchical agents through AI feedback and in-context learning. Our paper was recently published at ICLR 2025 and we open-source the whole project including the code, prompts and pre-trained models. Paper: Code: NotebookLM Podcast: This work was done with the amazing Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, with equal supervision by Marlos C. Machado and Pierluca D'Oro. Take a look at the following thread:

Martin Klissarov

80,280 views • 1 year ago

Open sourcing Dynamic Graph Memory by mem0. Memory is fundamental to human reasoning, shaping how we approach tasks and make decisions. At Mem0, we believe that AI agents & apps should reflect this principle. Our Dynamic Graph Memory emulates human memory, advancing AI agents toward more intelligent, human-like reasoning. This is a significant step forward in building AI that truly understands and interacts with the world like we do. All credit to Dev Khant Deshraj Yadav Prateek Chhikara for their countless nights spent on bringing this to life. Link:

Open sourcing Dynamic Graph Memory by mem0. Memory is fundamental to human reasoning, shaping how we approach tasks and make decisions. At Mem0, we believe that AI agents & apps should reflect this principle. Our Dynamic Graph Memory emulates human memory, advancing AI agents toward more intelligent, human-like reasoning. This is a significant step forward in building AI that truly understands and interacts with the world like we do. All credit to Dev Khant Deshraj Yadav Prateek Chhikara for their countless nights spent on bringing this to life. Link:

Taranjeet

51,129 views • 1 year ago

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

3D-LLM: Injecting the 3D World into Large Language Models paper page: Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs.

AK

249,708 views • 3 years ago

General Intuition CEO Pim de Witte, who's building foundation models trained on video game controller input data ("action-labeled gameplay clips"), says general intelligence won't "taste like an LLM": "We have a scale of data that's going to allow us to jump to the frontier in one capability — which is any system that can be controlled with a game controller (which is most robots) — and then, you can use that to create a sufficiently general intelligence." "As humans, the decision to talk or type is a very, very small subset of the actions that we can actually take." "So in order to create a sufficiently general intelligence to play 10,000+ video games, the model has to be able to predict across the entire action space of human cognition when they're interacting with these environments. Which are 2D and 3D environments, interfaces, long-horizon tasks, short-horizon tasks, [etc.]." "It has to be a sufficiently general intelligence in order to predict actions. Therefore, the type of model you get out is not going to taste like an LLM. This model is going to be incredibly good at navigating unforeseen environments. It's going to be incredibly good at zero-shotting any task that can be done with a game controller."

General Intuition CEO Pim de Witte, who's building foundation models trained on video game controller input data ("action-labeled gameplay clips"), says general intelligence won't "taste like an LLM": "We have a scale of data that's going to allow us to jump to the frontier in one capability — which is any system that can be controlled with a game controller (which is most robots) — and then, you can use that to create a sufficiently general intelligence." "As humans, the decision to talk or type is a very, very small subset of the actions that we can actually take." "So in order to create a sufficiently general intelligence to play 10,000+ video games, the model has to be able to predict across the entire action space of human cognition when they're interacting with these environments. Which are 2D and 3D environments, interfaces, long-horizon tasks, short-horizon tasks, [etc.]." "It has to be a sufficiently general intelligence in order to predict actions. Therefore, the type of model you get out is not going to taste like an LLM. This model is going to be incredibly good at navigating unforeseen environments. It's going to be incredibly good at zero-shotting any task that can be done with a game controller."

TBPN

80,143 views • 27 days ago

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet? Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks Paper: Leaderboard: We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals. Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck. Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation. See a more detailed thread by Jing Yu Koh.

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet? Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks Paper: Leaderboard: We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals. Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck. Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation. See a more detailed thread by Jing Yu Koh.

Russ Salakhutdinov

22,518 views • 2 months ago

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 views • 3 years ago

New short course: Long-Term Agentic Memory with LangGraph. Learn to build an agent with long-term memory in this course developed in collaboration with taught by its Co-Founder and CEO, Harrison Chase! Personal assistance and productivity tasks have become important use cases for agents. An important feature of an AI assistant, such as a coding or calendar assistant, is its ability to keep improving over time from its experience. Agent memory is the key capability that enables this. To add memory to an agent, you must first figure out what to store and what to retrieve when it is time to use the information. Additionally, you’ll have to decide when to update the stored information. For example, you might update in each iteration loop of the agent or perform updates in the background, with a helper agent. In this course, you will learn a mental framework to build agents with long-term memory. You'll create a useful email assistant that can respond, ignore, and notify using writing, scheduling, and memory-management tools. You’ll develop your agent's memory by adding facts to its memory store, provide examples to learn the user's preferences, and optimize system prompts to evolve instructions based on previous responses. In detail, you’ll: - Learn how the three types of memory--semantic, episodic, and procedural–and the two update mechanisms–via hot path and in the background–apply to your agents. - Build an email agent with writing, scheduling, and availability tools, along with a router that triages incoming email and handles it accordingly by ignoring, responding, or notifying the user. - Add tools to your email agent that allow it to operate on semantic memory by learning facts about the user, storing them in a long-term memory store, and searching over them in future interactions. - Incorporate episodic memory, in the form of few-shot examples, in the triage step of your agents to help them learn and update user preferences. - Add procedural memory as system prompts, optimized with feedback to improve the instructions the agent follows. Learn how to approach memory in agents, and start building agents with long-term memory with LangGraph! Please sign up here:

New short course: Long-Term Agentic Memory with LangGraph. Learn to build an agent with long-term memory in this course developed in collaboration with taught by its Co-Founder and CEO, Harrison Chase! Personal assistance and productivity tasks have become important use cases for agents. An important feature of an AI assistant, such as a coding or calendar assistant, is its ability to keep improving over time from its experience. Agent memory is the key capability that enables this. To add memory to an agent, you must first figure out what to store and what to retrieve when it is time to use the information. Additionally, you’ll have to decide when to update the stored information. For example, you might update in each iteration loop of the agent or perform updates in the background, with a helper agent. In this course, you will learn a mental framework to build agents with long-term memory. You'll create a useful email assistant that can respond, ignore, and notify using writing, scheduling, and memory-management tools. You’ll develop your agent's memory by adding facts to its memory store, provide examples to learn the user's preferences, and optimize system prompts to evolve instructions based on previous responses. In detail, you’ll: - Learn how the three types of memory--semantic, episodic, and procedural–and the two update mechanisms–via hot path and in the background–apply to your agents. - Build an email agent with writing, scheduling, and availability tools, along with a router that triages incoming email and handles it accordingly by ignoring, responding, or notifying the user. - Add tools to your email agent that allow it to operate on semantic memory by learning facts about the user, storing them in a long-term memory store, and searching over them in future interactions. - Incorporate episodic memory, in the form of few-shot examples, in the triage step of your agents to help them learn and update user preferences. - Add procedural memory as system prompts, optimized with feedback to improve the instructions the agent follows. Learn how to approach memory in agents, and start building agents with long-term memory with LangGraph! Please sign up here:

Andrew Ng

131,779 views • 1 year ago

New short course: LLMs as Operating Systems: Agent Memory, created with Letta, and taught by its founders Charles Packer and Sarah Wooders. An LLM's input context window has limited space. Using a longer input context also costs more and results in slower processing. So, managing what's stored in this context window is important. In the innovative paper MemGPT: Towards LLMs as Operating Systems, its authors (which include the instructors) proposed using an LLM agent to manage this context window. Their system uses a large persistent memory that stores everything that could be included in the input context, and an agent decides what is actually included. Take the example of building a chatbot that needs to remember what's been said earlier in a conversation (perhaps over many days of interaction with a user). As the conversation's length grows, the memory management agent will move information from the input context to a persistent searchable database; summarize information to keep relevant facts in the input context; and restore relevant conversation elements from further back in time. This allows a chatbot to keep what's currently most relevant in its input context memory to generate the next response. When I read the original MemGPT paper, I thought it was an innovative technique for handling memory for LLMs. The open-source Letta framework, which we'll use in this course, makes MemGPT easy to implement. It adds memory to your LLM agents and gives them transparent long-term memory. In detail, you’ll learn: - How to build an agent that can edit its own limited input context memory, using tools and multi-step reasoning - What is a memory hierarchy (an idea from computer operating systems, which use a cache to speed up memory access), and how these ideas apply to managing the LLM input context (where the input context window is a "cache" storing the most relevant information; and an agent decides what to move in and out of this to/from a larger persistent storage system) - How to implement multi-agent collaboration by letting different agents share blocks of memory This course will give you a sophisticated understanding of memory management for LLMs, which is important for chatbots having long conversations, and for complex agentic workflows. Please sign up here!

New short course: LLMs as Operating Systems: Agent Memory, created with Letta, and taught by its founders Charles Packer and Sarah Wooders. An LLM's input context window has limited space. Using a longer input context also costs more and results in slower processing. So, managing what's stored in this context window is important. In the innovative paper MemGPT: Towards LLMs as Operating Systems, its authors (which include the instructors) proposed using an LLM agent to manage this context window. Their system uses a large persistent memory that stores everything that could be included in the input context, and an agent decides what is actually included. Take the example of building a chatbot that needs to remember what's been said earlier in a conversation (perhaps over many days of interaction with a user). As the conversation's length grows, the memory management agent will move information from the input context to a persistent searchable database; summarize information to keep relevant facts in the input context; and restore relevant conversation elements from further back in time. This allows a chatbot to keep what's currently most relevant in its input context memory to generate the next response. When I read the original MemGPT paper, I thought it was an innovative technique for handling memory for LLMs. The open-source Letta framework, which we'll use in this course, makes MemGPT easy to implement. It adds memory to your LLM agents and gives them transparent long-term memory. In detail, you’ll learn: - How to build an agent that can edit its own limited input context memory, using tools and multi-step reasoning - What is a memory hierarchy (an idea from computer operating systems, which use a cache to speed up memory access), and how these ideas apply to managing the LLM input context (where the input context window is a "cache" storing the most relevant information; and an agent decides what to move in and out of this to/from a larger persistent storage system) - How to implement multi-agent collaboration by letting different agents share blocks of memory This course will give you a sophisticated understanding of memory management for LLMs, which is important for chatbots having long conversations, and for complex agentic workflows. Please sign up here!

Andrew Ng

200,788 views • 1 year ago

World Simulator, reimagined — now alive with humans, robots, and their vibrant society unfolding in 3D real-world geospatial scenes across the globe! 🚀 One day soon, humans and robots will co-exist in the same world. To prepare, we must address: 1️⃣ How can robots cooperate or compete intelligently? 2️⃣ How do humans build social bonds and communities? 3️⃣ How can both co-exist in an open, dynamic world? Announcing Virtual Community Project — a social-physical world simulator, where human characters and robotic agents can interact, grow, and co-evolve within open-world societies, stretching from London to New York, and beyond! Key features include: ✅ Unified multi-agent physics simulations for rich social + physical interactions of humans and robots ✅ Massive auto-generated 3D scenes grounded with the rea-world geospatial data ✅ Agent communities populated by robots and LLM-driven human characters with rich appearances, personalities, and social ties. 🌍 Enter our Virtual Community, an open world to study embodied AI at scale— one social-physical world model at a time! 🔗 Project: 💻 Code: Paper: 1/n

World Simulator, reimagined — now alive with humans, robots, and their vibrant society unfolding in 3D real-world geospatial scenes across the globe! 🚀 One day soon, humans and robots will co-exist in the same world. To prepare, we must address: 1️⃣ How can robots cooperate or compete intelligently? 2️⃣ How do humans build social bonds and communities? 3️⃣ How can both co-exist in an open, dynamic world? Announcing Virtual Community Project — a social-physical world simulator, where human characters and robotic agents can interact, grow, and co-evolve within open-world societies, stretching from London to New York, and beyond! Key features include: ✅ Unified multi-agent physics simulations for rich social + physical interactions of humans and robots ✅ Massive auto-generated 3D scenes grounded with the rea-world geospatial data ✅ Agent communities populated by robots and LLM-driven human characters with rich appearances, personalities, and social ties. 🌍 Enter our Virtual Community, an open world to study embodied AI at scale— one social-physical world model at a time! 🔗 Project: 💻 Code: Paper: 1/n

Chuang Gan

90,261 views • 1 year ago

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

Tracking Anything with Decoupled Video Segmentation paper page: Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation.

AK

305,667 views • 2 years ago

Introducing EdgeBench, a benchmark designed to study how agents learn from environments over at least 12~72-hour runs. We find that performance follows a log-sigmoid function of environment interaction time with high precision. EdgeBench is built with three ingredients: - 🌍 Real & Diverse: 134 real-world tasks across 6 task categories, spanning scientific problems, professional knowledge work, software engineering, optimization, formal math, and games. - ⏳ Ultra-Long-Horizon: Each task supports 12–72 hours of agent work. Recorded human effort averages 57.2 hours. - 🔁 Informative Feedback: Agents receive real-world feedback for continuous improvement. After 38,000 hours of agent runs on EdgeBench, a scaling law for learning from environments emerges: - 📈 As agents interact with task environments over time, their aggregate performance is precisely fit by a log-sigmoid function. - 🧠 This phenomenon can be explained by an elegant theory of graph exploration. We are releasing an initial 51 of the 134 tasks, together with the full evaluation framework, to help advance long-horizon agent research. Check our blog & paper for more findings! Blog Paper GitHub Dataset Details below 👇🧵

Introducing EdgeBench, a benchmark designed to study how agents learn from environments over at least 12~72-hour runs. We find that performance follows a log-sigmoid function of environment interaction time with high precision. EdgeBench is built with three ingredients: - 🌍 Real & Diverse: 134 real-world tasks across 6 task categories, spanning scientific problems, professional knowledge work, software engineering, optimization, formal math, and games. - ⏳ Ultra-Long-Horizon: Each task supports 12–72 hours of agent work. Recorded human effort averages 57.2 hours. - 🔁 Informative Feedback: Agents receive real-world feedback for continuous improvement. After 38,000 hours of agent runs on EdgeBench, a scaling law for learning from environments emerges: - 📈 As agents interact with task environments over time, their aggregate performance is precisely fit by a log-sigmoid function. - 🧠 This phenomenon can be explained by an elegant theory of graph exploration. We are releasing an initial 51 of the 134 tasks, together with the full evaluation framework, to help advance long-horizon agent research. Check our blog & paper for more findings! Blog Paper GitHub Dataset Details below 👇🧵

Deyao Zhu

356,465 views • 24 days ago

“ xAI team is currently working heavily on coding models. Right now, the main focus is training a specialized coding model that will be both fast and smart. I believe we’ll share it with you guys in a few weeks. That’s exciting. Second ,after coding, we all see that the main weakness of Grok 4 is its multimodal capabilities. In fact, it was so bad that Grok was effectively looking at the world while squinting through glass trying to see blurry features and make sense of them. The most immediate improvement we’re stepping on with the next generation pre trained model is huge gains in image understanding, video understanding, and audio. Right now, the model can hear and see the world just like any of you. And with all the tools and other agents it can talk to, we’re going to see a huge unlock for many different application layers once multimodal agents arrive.” — xAI Team

“ xAI team is currently working heavily on coding models. Right now, the main focus is training a specialized coding model that will be both fast and smart. I believe we’ll share it with you guys in a few weeks. That’s exciting. Second ,after coding, we all see that the main weakness of Grok 4 is its multimodal capabilities. In fact, it was so bad that Grok was effectively looking at the world while squinting through glass trying to see blurry features and make sense of them. The most immediate improvement we’re stepping on with the next generation pre trained model is huge gains in image understanding, video understanding, and audio. Right now, the model can hear and see the world just like any of you. And with all the tools and other agents it can talk to, we’re going to see a huge unlock for many different application layers once multimodal agents arrive.” — xAI Team

Apurv Kochara

88,142 views • 5 months ago

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

Google presents AudioPaLM: A Large Language Model That Can Speak and Listen paper page: introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text large language models such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only large language model improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio language models, such as transferring a voice across languages based on a short spoken prompt.

AK

290,517 views • 3 years ago

Day 3/5 of #MiniMaxWeek: MiniMax Agent — Code is Cheap, Show Me the Requirement Today, we’re officially launching MiniMax Agent: a general intelligent agent built to tackle long-horizon, complex tasks. From expert-level multi-step planning to flexible task breakdown and end-to-end execution — it’s designed to act like a reliable teammate, with strengths in: -Programming & tool use -Multimodal understanding & generation -Seamless MCP integration Already in internal use for 60 days, it’s become a daily tool for over 50% of our team. Here’s a classic saying: “Talk is cheap, show me the code.” But with intelligent agents, something shifts. Now we say: “Code is cheap, show me the requirement.” Try it now:

Day 3/5 of #MiniMaxWeek: MiniMax Agent — Code is Cheap, Show Me the Requirement Today, we’re officially launching MiniMax Agent: a general intelligent agent built to tackle long-horizon, complex tasks. From expert-level multi-step planning to flexible task breakdown and end-to-end execution — it’s designed to act like a reliable teammate, with strengths in: -Programming & tool use -Multimodal understanding & generation -Seamless MCP integration Already in internal use for 60 days, it’s become a daily tool for over 50% of our team. Here’s a classic saying: “Talk is cheap, show me the code.” But with intelligent agents, something shifts. Now we say: “Code is cheap, show me the requirement.” Try it now:

MiniMax (official)

459,894 views • 1 year ago

🦾 From seeing to doing. We're closing the loop between video prediction and real-world action. On the final day of Robbyant Open Source Week, we bring you LingBot-VA—the world's first causal video-action world model for generalist robot control. 🔥 Key Highlights: 🤖 Predicts & Acts: A single model generates both future video and the actions to make it real. 🧠 Remembers the Past: True long-term memory for complex, sequential tasks. ⚡ Learns in a Snap: Masters new skills with just 30-50 real-world demos. The result? 📈 SOTA on RoboTwin (92.9%) and LIBERO (98.5%), +20% over π0.5 on challenging real-world long-horizon & high-precision tasks. More below 👇 #AI #Robotics #EmbodiedAI #WorldModel #OpenSource

🦾 From seeing to doing. We're closing the loop between video prediction and real-world action. On the final day of Robbyant Open Source Week, we bring you LingBot-VA—the world's first causal video-action world model for generalist robot control. 🔥 Key Highlights: 🤖 Predicts & Acts: A single model generates both future video and the actions to make it real. 🧠 Remembers the Past: True long-term memory for complex, sequential tasks. ⚡ Learns in a Snap: Masters new skills with just 30-50 real-world demos. The result? 📈 SOTA on RoboTwin (92.9%) and LIBERO (98.5%), +20% over π0.5 on challenging real-world long-horizon & high-precision tasks. More below 👇 #AI #Robotics #EmbodiedAI #WorldModel #OpenSource

Robbyant

704,157 views • 5 months ago