正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments? We present MaestroMotif, a method for AI-assisted skill design that produces highly capable and steerable hierarchical agents. To the best of our knowledge, it is the first method that, without expert labeled datasets, solves compositional tasks... requiring hundreds of steps for completion. All the modules within MaestroMotif are learned from interaction: from the highest level of planning to the lowest-level of sensorimotor control. On the open-ended domain of NetHack, it surpasses existing approaches, including those that are fine-tuned specifically for each task. At the heart of MaestroMotif is the idea that decomposing a task into subtasks significantly helps decision making. MaestroMotif leverages an agent designer's intuition about a domain to identify important skills and describe them in natural language. These short descriptions then get converted into adaptable hierarchical agents through AI feedback and in-context learning. Our paper was recently published at ICLR 2025 and we open-source the whole project including the code, prompts and pre-trained models. Paper: Code: NotebookLM Podcast: This work was done with the amazing Mikael Henaff, Roberta Raileanu, Shagun Sodhani, Pascal Vincent, Amy Zhang, Pierre-Luc Bacon, Doina Precup, with equal supervision by Marlos C. Machado and Pierluca D'Oro. Take a look at the following thread:show more

Martin Klissarov

2,819 subscribers

80,217 次观看 • 1 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

11 条评论

Martin Klissarov 的头像

Martin Klissarov1 年前

MaestroMotif builds on our previous work, Motif, which pioneered learning RL policies from AI feedback. At the time, it set a new state-of-the-art on the open-ended domain of NetHack. With MaestroMotif, we improve on this performance by two orders of magnitude. But, how are these gains obtained? In a couple words: from task decomposition.

Martin Klissarov 的头像

Martin Klissarov1 年前

MaestroMotif is a scalable and effective algorithm for AI-assisted skill design. It starts by leveraging an agent designer’s prior knowledge about a domain who defines a set of useful skills, or agents. Agents/skills here are described on a high level in natural language. MaestroMotif then proceeds to convert these descriptions into reward models through the process of AI feedback. These rewards encode a notion of good behaviour for each of the skills. MaestroMotif then plans, through in-context learning and unit-test feedback, a strategy for executing the skills in the environment. This strategy is instantiated in the form of code policy over skills.

Martin Klissarov 的头像

Martin Klissarov1 年前

Once the skill policies are learned, MaestroMotif can adapt, zero-shot, to new instructions and solve complex tasks simply by re-combining skills, similarly to motifs in a composition. In other words, it writes a different code policy over skills which achieves a completely different task.

Martin Klissarov 的头像

Martin Klissarov1 年前

We highlight the complexity of some of these tasks, which on average take more than a thousand steps for completion. Even methods that are trained specifically for each task are not able to make any kind of progress.

Martin Klissarov 的头像

Martin Klissarov1 年前

Evaluations in such complex tasks is only possibly thanks to the work of dedicated fans of NetHack, who have been building and upgrading the game since 1987 (it is still an ongoing and maintained repository). We show in this figure some of the complexities of NetHack. A few years back, AI researchers (@HeinrichKuttler , @egrefen and @_rockt to name a few) foresaw the importance of such an environment and created the @NetHack_LE , which allows for fast experimenting with RL agents on an incredibly complex environment.

Martin Klissarov 的头像

Martin Klissarov1 年前

has also recently been used within the Balrog benchmark ( from @PaglieriDavide @CupiaBart et al., which emphasizes the difficulty of current LLMs to perform well over long horizon tasks. In this benchmark, @NetHack_LE is undoubtedly the hardest domain. See this announcement:

Martin Klissarov 的头像

Martin Klissarov1 年前

An interesting discovery we came across was how the skills that were learned naturally emerged in a form of curriculum. To give more context, we used a single skill-conditioned neural network to learn all behaviours, and these behaviours were learned simultaneously. As a result, easier skills are the first to maximize their skill reward, paving the way for more complex skills to be learned. TL;DR: Hierarchy affords learnability.

Martin Klissarov 的头像

Martin Klissarov1 年前

Finally, we analyze the choice of the LLM used to write code policies. We notice a scaling behaviour wherein only the largest open-source LLM of the time, Llama 3.1 405b, was able to define policies that were successful on all tasks. With the advent of thinking models, it would be interesting to investigate their ability to orchestrate skills through code.

Martin Klissarov 的头像

Martin Klissarov1 年前

@twimlai @TalkRLPodcast @DrJimFan @nathanbenaich @_akhaliq @arankomatsuzaki @Mila_Quebec @AmiiThinks @AIatMeta @ylecun

Fast Company 的头像

Fast Company1 年前

4 ways #AI can help in a challenging market. Find out how your company can harness the potential of AI while minimizing risks and paving the way for more ambitious applications as the technology continues to develop. Learn more at @JLL. #ad

Wannan (Winnie) Yang 🧠🤖 的头像

Wannan (Winnie) Yang 🧠🤖1 年前

@_rockt Amazing Martin! Congratulations 🎉

相关视频

Can reinforcement learning from AI feedback unlock new capabilities in AI agents? Introducing Motif, an LLM-powered method for intrinsic motivation from AI feedback. Motif extracts reward functions from Llama 2's preferences and uses them to train agents with reinforcement learning. On the complex NetHack game, Motif solves previously unsolved tasks without needing any expert demonstrations. Surprisingly, Motif's reward leads to better game score than the one obtained by using the score itself as a reward. Given access to an event captioning mechanism, a few properties make Motif a general method: • it is entirely based on open models • the LLM doesn't need direct access to the environment dynamics (e.g., its source code) • the LLM doesn't need to understand observation and action spaces The best part? You can start using Motif right now, even on a small compute budget: the whole pipeline can take less than two GPU-days. Feel free to read our paper and try our code out. Paper: Code: Blog post: Work co-lead by Martin Klissarov and myself, with Shagun Sodhani Roberta Raileanu Pierre-Luc Bacon Pascal Vincent Amy Zhang Mikael Henaff Learn more in the thread 🧵

Can reinforcement learning from AI feedback unlock new capabilities in AI agents? Introducing Motif, an LLM-powered method for intrinsic motivation from AI feedback. Motif extracts reward functions from Llama 2's preferences and uses them to train agents with reinforcement learning. On the complex NetHack game, Motif solves previously unsolved tasks without needing any expert demonstrations. Surprisingly, Motif's reward leads to better game score than the one obtained by using the score itself as a reward. Given access to an event captioning mechanism, a few properties make Motif a general method: • it is entirely based on open models • the LLM doesn't need direct access to the environment dynamics (e.g., its source code) • the LLM doesn't need to understand observation and action spaces The best part? You can start using Motif right now, even on a small compute budget: the whole pipeline can take less than two GPU-days. Feel free to read our paper and try our code out. Paper: Code: Blog post: Work co-lead by Martin Klissarov and myself, with Shagun Sodhani Roberta Raileanu Pierre-Luc Bacon Pascal Vincent Amy Zhang Mikael Henaff Learn more in the thread 🧵

Pierluca D'Oro

311,883 次观看 • 2 年前

LARP: Language-Agent Role Play for Open-World Games paper page: Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we introduce Language Agent for Role-Playing (LARP), which includes a cognitive architecture that encompasses memory processing and a decision-making assistant, an environment interaction module with a feedback-driven learnable action space, and a postprocessing method that promotes the alignment of various personalities. The LARP framework refines interactions between users and agents, predefined with unique backgrounds and personalities, ultimately enhancing the gaming experience in open-world contexts. Furthermore, it highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios.

LARP: Language-Agent Role Play for Open-World Games paper page: Language agents have shown impressive problem-solving skills within defined settings and brief timelines. Yet, with the ever-evolving complexities of open-world simulations, there's a pressing need for agents that can flexibly adapt to complex environments and consistently maintain a long-term memory to ensure coherent actions. To bridge the gap between language agents and open-world games, we introduce Language Agent for Role-Playing (LARP), which includes a cognitive architecture that encompasses memory processing and a decision-making assistant, an environment interaction module with a feedback-driven learnable action space, and a postprocessing method that promotes the alignment of various personalities. The LARP framework refines interactions between users and agents, predefined with unique backgrounds and personalities, ultimately enhancing the gaming experience in open-world contexts. Furthermore, it highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios.

AK

143,974 次观看 • 2 年前

Today we’re launching the first and only human-like AI agents in the world. Super Agents™ are the first agents with human‑level skills – they DM you, take @ mentions, send emails, manage docs, tasks, and more. Not just tools or API calls, but real skills fine‑tuned for how teams actually work. The first agents with 100% context – fully native in ClickUp and fully synced from other apps. Super Agents see your work the same way that humans do: tasks, docs, schedules, and conversations all in one place. The first agents that learn from human interactions automatically, without any setup or configuration – when you give feedback, they listen and improve how they work. The first agents with human‑level memory for custom agents – historical memory for every interaction, short-term working memory, and even long‑term memory stored in docs you can literally open, inspect, and edit. The first agents that are literally the same as users – our agentic user model is the same as our user data model. This gives you permissions and capabilities that you and your systems are already familiar with. The first infinite agent catalog – where anyone can create and customize agents in minutes, for literally any type of work imaginable. It's the most intuitive way to build agents on the planet. 95% of companies are failing in AI adoption. The reality is that AI isn't meant to be adopted, it's meant to be adapted – to you. Super Agents are automatically personalized to you and your company using proprietary state-of-the-art agent architecture, orchestration, and tooling. Today is the largest step forward we've ever made towards our mission of making people more productive. Maximize human productivity, with ClickUp Super Agents. Available NOW. For everyone.

Today we’re launching the first and only human-like AI agents in the world. Super Agents™ are the first agents with human‑level skills – they DM you, take @ mentions, send emails, manage docs, tasks, and more. Not just tools or API calls, but real skills fine‑tuned for how teams actually work. The first agents with 100% context – fully native in ClickUp and fully synced from other apps. Super Agents see your work the same way that humans do: tasks, docs, schedules, and conversations all in one place. The first agents that learn from human interactions automatically, without any setup or configuration – when you give feedback, they listen and improve how they work. The first agents with human‑level memory for custom agents – historical memory for every interaction, short-term working memory, and even long‑term memory stored in docs you can literally open, inspect, and edit. The first agents that are literally the same as users – our agentic user model is the same as our user data model. This gives you permissions and capabilities that you and your systems are already familiar with. The first infinite agent catalog – where anyone can create and customize agents in minutes, for literally any type of work imaginable. It's the most intuitive way to build agents on the planet. 95% of companies are failing in AI adoption. The reality is that AI isn't meant to be adopted, it's meant to be adapted – to you. Super Agents are automatically personalized to you and your company using proprietary state-of-the-art agent architecture, orchestration, and tooling. Today is the largest step forward we've ever made towards our mission of making people more productive. Maximize human productivity, with ClickUp Super Agents. Available NOW. For everyone.

Zeb Evans

320,417 次观看 • 6 个月前

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

AK

141,425 次观看 • 2 年前

Interestingly, as we have AI agents that run in the background, the speed of AI becomes incrementally less important than the core underlying capability level. When you could only give AI small bits of work, then the speed of response mattered a ton. The rate at which you can go back and forth with AI in real-time was the determining factor on how useful it is. But as AI agents can perform more complex and useful tasks in parallel behind the scenes, you can much more easily afford to wait longer for work to get done assuming it’s valuable. Now the focus becomes more on how you review and orchestrate the agents’ work. And the main factor is just how useful and usable is the output that the agent came back with.

Interestingly, as we have AI agents that run in the background, the speed of AI becomes incrementally less important than the core underlying capability level. When you could only give AI small bits of work, then the speed of response mattered a ton. The rate at which you can go back and forth with AI in real-time was the determining factor on how useful it is. But as AI agents can perform more complex and useful tasks in parallel behind the scenes, you can much more easily afford to wait longer for work to get done assuming it’s valuable. Now the focus becomes more on how you review and orchestrate the agents’ work. And the main factor is just how useful and usable is the output that the agent came back with.

Aaron Levie

103,811 次观看 • 9 个月前

Thrilled to see Amazon Web Services making a major contribution to the open source AI community with the launch of the Strands Agents, an open source AI agents SDK! The core of Strands is the simple agentic loop that connects the model and tools together, like the two strands of DNA. This model-driven approach to agent building eliminates the need for complex agent orchestration by embracing the capabilities of state-of-the-art models to plan, chain thoughts, call tools, and reflect. Providing open source tools and interoperability with open source protocols is an important part of our strategy to enable an agentic future. Can't wait to see what you build with Strands!

Thrilled to see Amazon Web Services making a major contribution to the open source AI community with the launch of the Strands Agents, an open source AI agents SDK! The core of Strands is the simple agentic loop that connects the model and tools together, like the two strands of DNA. This model-driven approach to agent building eliminates the need for complex agent orchestration by embracing the capabilities of state-of-the-art models to plan, chain thoughts, call tools, and reflect. Providing open source tools and interoperability with open source protocols is an important part of our strategy to enable an agentic future. Can't wait to see what you build with Strands!

Swami Sivasubramanian

32,185 次观看 • 1 年前

.Aaron Levie says AI agents may take years to diffuse from engineering into the rest of knowledge work. "The models are really good at code, and the work is verifiable." "You have 5 or 10 things that make agents work in an enterprise context for engineering." "The users are less technical, the data is much more fragmented, the systems are much more legacy." "That's why it's going to be a number of years for this diffusion to roll from what we're seeing in Silicon Valley, into the rest of knowledge work."

.Aaron Levie says AI agents may take years to diffuse from engineering into the rest of knowledge work. "The models are really good at code, and the work is verifiable." "You have 5 or 10 things that make agents work in an enterprise context for engineering." "The users are less technical, the data is much more fragmented, the systems are much more legacy." "That's why it's going to be a number of years for this diffusion to roll from what we're seeing in Silicon Valley, into the rest of knowledge work."

MTS

13,046 次观看 • 2 个月前

NVIDIA CEO Jensen Huang just gave us a glimpse into the future of AI, and it's far bigger than text and images. He asked: "If you can understand worlds, is it possible that you understand proteins and chemicals that have structure?” This is the quiet revolution happening in AI right now. We're moving from an AI that understands human language to an AI that understands the language of nature itself. The "letters" are atoms. The "words" are molecules. The "sentences" are the complex structures of life, networks, and reality. The same foundational models that generate video are now being used to: *Decode proteins to fight disease *Design new materials at the atomic level *Simulate complex systems in quantum computing These were once the domain of supercomputers and Nobel laureates. Now, they're becoming solvable problems for a new generation of startups. AI is transitioning from a tool for communication to a tool for creation and discovery in the physical world. This is where the next wave of generational companies will be built. The "hard-to-solve" problems are now within reach.

NVIDIA CEO Jensen Huang just gave us a glimpse into the future of AI, and it's far bigger than text and images. He asked: "If you can understand worlds, is it possible that you understand proteins and chemicals that have structure?” This is the quiet revolution happening in AI right now. We're moving from an AI that understands human language to an AI that understands the language of nature itself. The "letters" are atoms. The "words" are molecules. The "sentences" are the complex structures of life, networks, and reality. The same foundational models that generate video are now being used to: Decode proteins to fight disease Design new materials at the atomic level *Simulate complex systems in quantum computing These were once the domain of supercomputers and Nobel laureates. Now, they're becoming solvable problems for a new generation of startups. AI is transitioning from a tool for communication to a tool for creation and discovery in the physical world. This is where the next wave of generational companies will be built. The "hard-to-solve" problems are now within reach.

Konstantine Buhler

13,014 次观看 • 8 个月前

As AI agents face increasingly long and complex tasks, decomposing them into subtasks becomes increasingly appealing. But how do we discover such temporal structure? Hierarchical RL provides a natural formalism-yet many questions remain open. Here's our overview of the field🧵

As AI agents face increasingly long and complex tasks, decomposing them into subtasks becomes increasingly appealing. But how do we discover such temporal structure? Hierarchical RL provides a natural formalism-yet many questions remain open. Here's our overview of the field🧵

Martin Klissarov

36,008 次观看 • 1 年前

AI AGENTS 101 (58 minute free masterclass) send this to anyone who wants to understand ai agents, claude skills, md files, how to get the most out of AI etc in plain english: 1. chat vs agents - chat models answer questions in a back and forth while agents take a goal, figure out the steps, and deliver a result 2. agents don’t stop after one response. they keep running until the task is actually finishedno babysitting required 3. everything runs on a loop. they gather context, decide what to do, take an action, then repeat until done 4. the loop is the system. they look at files, tools, and the internet. decide the next step. execute and then feed that back into the next step. over and over until completion 5. the model is just one piece. gpt, claude, gemini are the reasoning layer. the key is model + loop + tools + context 6. mcp is how agents use tools. it connects things like browser, code, apis, and your internal software. once connected, the agent decides when to use them to get the job done 7. context beats prompt all day. you don't need to write perfect prompts. load your agent with context about your business, style, and goals and then simple instructions work 8. claude.md or agents.md is the onboarding doc it tells the agent who it is, how to behave, what it knows, and what tools it can use. this gets loaded every time before it starts 9. memory.md is how it improves. agents don’t remember by default. this file stores preferences, corrections, and patterns you tell the agent to update it, and it gets better over time 10. skills + harnesses make it usable. skills are reusable tasks like writing, research, analysis the harness is the environment like claude code or openclaw that runs everything. basiclaly, different interfaces, same system underneath this episode with remy on The Startup Ideas Podcast (SIP) 🧃 was one of the clearest ways of understanding a lot of the core concepts of ai agents could be the best beginners course for ai agents 58 mins. all free. no advertisers. i just want to see you build cool stuff. im rooting for you. send to a friend watch

AI AGENTS 101 (58 minute free masterclass) send this to anyone who wants to understand ai agents, claude skills, md files, how to get the most out of AI etc in plain english: 1. chat vs agents - chat models answer questions in a back and forth while agents take a goal, figure out the steps, and deliver a result 2. agents don’t stop after one response. they keep running until the task is actually finishedno babysitting required 3. everything runs on a loop. they gather context, decide what to do, take an action, then repeat until done 4. the loop is the system. they look at files, tools, and the internet. decide the next step. execute and then feed that back into the next step. over and over until completion 5. the model is just one piece. gpt, claude, gemini are the reasoning layer. the key is model + loop + tools + context 6. mcp is how agents use tools. it connects things like browser, code, apis, and your internal software. once connected, the agent decides when to use them to get the job done 7. context beats prompt all day. you don't need to write perfect prompts. load your agent with context about your business, style, and goals and then simple instructions work 8. claude.md or agents.md is the onboarding doc it tells the agent who it is, how to behave, what it knows, and what tools it can use. this gets loaded every time before it starts 9. memory.md is how it improves. agents don’t remember by default. this file stores preferences, corrections, and patterns you tell the agent to update it, and it gets better over time 10. skills + harnesses make it usable. skills are reusable tasks like writing, research, analysis the harness is the environment like claude code or openclaw that runs everything. basiclaly, different interfaces, same system underneath this episode with remy on The Startup Ideas Podcast (SIP) 🧃 was one of the clearest ways of understanding a lot of the core concepts of ai agents could be the best beginners course for ai agents 58 mins. all free. no advertisers. i just want to see you build cool stuff. im rooting for you. send to a friend watch

GREG ISENBERG

375,319 次观看 • 3 个月前

We just open-sourced Simulang, a JavaScript library that gives you and your coding agents the ability to fully control computers. Simulang is our scripting framework for desktop automation. It allows you to write deterministic computer use code that can be replayed dozens if not hundreds of times. Much faster and cheaper, compared to LLM agents. It ships with a skill out of the box. That means you can turn natural language prompts into Simulang code directly, and your agent can now see the screen, click, type, and navigate. It gives your AI agents eyes and hands to control the computer like humans do. Try these things with Simulang: - upload your pile of invoices to the expense platform - find 50 people with specific criteria on LinkedIn - open your calculator app and do some math - find all buttons on your favorite app using accessibility tree - or draw a bunny with MS paint 🐰 In this demo, we show how Simulang took over when the coding agent couldn’t open and click on the target LinkedIn profile page. Give it a try:

Simular

14,218 次观看 • 1 个月前

Earlier this week at GTC, we announced our partnership with Nvidia. We will work with Nvidia to build strong, American open-source models that are at the frontier of scientific reasoning. These models will be essential for the US to compete with China on science in the coming decades. Jensen is committing to spend tens of billions of dollars developing open-source models, and we are excited to be a partner with them in figuring out how to benchmark, train and use those agents to accelerate scientific research. We have already open-sourced some of the work we have done with them, and are looking forward to open-sourcing more. There are few things today that are more important. See our blog post below, and watch the video to learn more, narrated by the man himself.

Earlier this week at GTC, we announced our partnership with Nvidia. We will work with Nvidia to build strong, American open-source models that are at the frontier of scientific reasoning. These models will be essential for the US to compete with China on science in the coming decades. Jensen is committing to spend tens of billions of dollars developing open-source models, and we are excited to be a partner with them in figuring out how to benchmark, train and use those agents to accelerate scientific research. We have already open-sourced some of the work we have done with them, and are looking forward to open-sourcing more. There are few things today that are more important. See our blog post below, and watch the video to learn more, narrated by the man himself.

Sam Rodriques

23,200 次观看 • 3 个月前

Everyone wants agent swarms. Very few people are talking seriously enough about the context layer that makes swarms useful. Even with one agent, context is fragile. Too little context and the agent guesses. Too much context and it wastes tokens, loses focus, or reasons over irrelevant noise. The sweet spot is precise context: the right knowledge, in the right structure, at the right moment. With many agents, that challenge explodes. Each agent produces decisions, assumptions, findings, summaries, risks, and partial conclusions. Unless that knowledge becomes shared, structured, and reusable, every new agent is forced to rediscover what another agent already learned. That is not a swarm. That is a crowd. Shared context graphs are what turn agent activity into agent collaboration, and OriginTrail DKG V10 brings them to life. Was just playing with some final polishing for the V10 release, and it is really powerful to see shared context graphs where multiple agents contribute knowledge into the same connected memory, with attribution visible directly in the graph ui. That matters for three reasons. First, agents can access and build on one shared memory instead of staying trapped in isolated sessions. Second, the graph structure helps them retrieve the exact context they need, instead of stuffing everything into a prompt and hoping the model sorts it out. Third, verifiability of provenance. You can see which agent contributed each piece of knowledge, trace the source, and decide what to trust. Tokenmaxxing starts with fewer tokens, but the deeper story is coordination - agents stop reloading the world and start building on shared, verifiable context. That is the foundation for serious multi-agent work across software engineering, research, finance, operations, project management, and far beyond. The future is not more agents, it is agents working from shared, verifiable context. But the more the merrier, of course.

Everyone wants agent swarms. Very few people are talking seriously enough about the context layer that makes swarms useful. Even with one agent, context is fragile. Too little context and the agent guesses. Too much context and it wastes tokens, loses focus, or reasons over irrelevant noise. The sweet spot is precise context: the right knowledge, in the right structure, at the right moment. With many agents, that challenge explodes. Each agent produces decisions, assumptions, findings, summaries, risks, and partial conclusions. Unless that knowledge becomes shared, structured, and reusable, every new agent is forced to rediscover what another agent already learned. That is not a swarm. That is a crowd. Shared context graphs are what turn agent activity into agent collaboration, and OriginTrail DKG V10 brings them to life. Was just playing with some final polishing for the V10 release, and it is really powerful to see shared context graphs where multiple agents contribute knowledge into the same connected memory, with attribution visible directly in the graph ui. That matters for three reasons. First, agents can access and build on one shared memory instead of staying trapped in isolated sessions. Second, the graph structure helps them retrieve the exact context they need, instead of stuffing everything into a prompt and hoping the model sorts it out. Third, verifiability of provenance. You can see which agent contributed each piece of knowledge, trace the source, and decide what to trust. Tokenmaxxing starts with fewer tokens, but the deeper story is coordination - agents stop reloading the world and start building on shared, verifiable context. That is the foundation for serious multi-agent work across software engineering, research, finance, operations, project management, and far beyond. The future is not more agents, it is agents working from shared, verifiable context. But the more the merrier, of course.

Jurij Skornik

11,070 次观看 • 1 个月前

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 次观看 • 1 年前

Preview of our DeepResearchSwarm™ It's incomprehensibly fast as it leverages a multi-agent architecture instead of just one agent. It utilizes a joint hierarchical-parallel architecture that creates a plan and then concurrently executes the plan with multiple search agents. Workflow: Task → Research Director → multiple search agents → final summary by summarization agent. Stay tuned for the release ↓

Preview of our DeepResearchSwarm™ It's incomprehensibly fast as it leverages a multi-agent architecture instead of just one agent. It utilizes a joint hierarchical-parallel architecture that creates a plan and then concurrently executes the plan with multiple search agents. Workflow: Task → Research Director → multiple search agents → final summary by summarization agent. Stay tuned for the release ↓

swarms

17,729 次观看 • 1 年前

As AI agents get better at computer and tool use, or writing code on the fly for a task, we're going to be able to solve much broader domains of knowledge work. Here's an example of Box AI with the new Claude Skills to generate a clean powerpoint file from existing data.

As AI agents get better at computer and tool use, or writing code on the fly for a task, we're going to be able to solve much broader domains of knowledge work. Here's an example of Box AI with the new Claude Skills to generate a clean powerpoint file from existing data.

Aaron Levie

30,235 次观看 • 8 个月前

The same kinds of productivity gains we've seen in coding with AI agents are heading to the rest of knowledge work. This is the jump when you go from having a chatbot to being able to actually have an agent go off and do work for minutes or even hours and come back with a complete work output that you then review. Here's an example of the new Box Agent filling out an RFP response from an existing knowledge base. This process would normally take hours to fill out, and requires the full attention of the user doing the work. Now, you provide the Box Agent with the RFP questions, and it will go off, make a plan, extract all the relevant questions, read through existing source material to come up with an answer, and then generate a new word document as the final output. All while you're doing something else. The key to this architecture is that the agent is able to use all of the same tools in the background that a user uses to get work done. The agent can search for documents, read entire files, run scripts and tools in the background, and even be able to write code on the fly to automate tasks it hasn't seen before. And best of all, the Box Agent will (soon) work from the Box MCP and CLI so you can invoke it in any agentic system as a step in a process. This kind of agent complexity would have been impossible even 6 months ago. Models consistently failed at tracking long running tasks or using the right tools at the right moment for the task. But this is all now possible because of models like GPT-5.4, Opus 4.6, and Gemini 3, and is only getting better by the month. Just as we moved from engineers writing code and using AI as an assistant to answer questions, in many areas of knowledge work -like legal, finance, consulting, sales, marketing, and more- when we have a problem we'll just kick off the AI agent to just go work on it for us in the background.

The same kinds of productivity gains we've seen in coding with AI agents are heading to the rest of knowledge work. This is the jump when you go from having a chatbot to being able to actually have an agent go off and do work for minutes or even hours and come back with a complete work output that you then review. Here's an example of the new Box Agent filling out an RFP response from an existing knowledge base. This process would normally take hours to fill out, and requires the full attention of the user doing the work. Now, you provide the Box Agent with the RFP questions, and it will go off, make a plan, extract all the relevant questions, read through existing source material to come up with an answer, and then generate a new word document as the final output. All while you're doing something else. The key to this architecture is that the agent is able to use all of the same tools in the background that a user uses to get work done. The agent can search for documents, read entire files, run scripts and tools in the background, and even be able to write code on the fly to automate tasks it hasn't seen before. And best of all, the Box Agent will (soon) work from the Box MCP and CLI so you can invoke it in any agentic system as a step in a process. This kind of agent complexity would have been impossible even 6 months ago. Models consistently failed at tracking long running tasks or using the right tools at the right moment for the task. But this is all now possible because of models like GPT-5.4, Opus 4.6, and Gemini 3, and is only getting better by the month. Just as we moved from engineers writing code and using AI as an assistant to answer questions, in many areas of knowledge work -like legal, finance, consulting, sales, marketing, and more- when we have a problem we'll just kick off the AI agent to just go work on it for us in the background.

Aaron Levie

24,618 次观看 • 2 个月前

Agent Trace: Capturing the Context Graph of Code We are delighted to collaborate with Cursor, OpenCode, Vercel, Jules, Amp, Cloudflare, and Sasha Varlamov in an open standard for mapping back code:context. here's how we see the potential of code context graphs and the new era of better tooling and better agents it enables. (yes the following is vibe-videoed with Remotion's Skill and Windsurf (retired), 100% ai edits incl audio)

Agent Trace: Capturing the Context Graph of Code We are delighted to collaborate with Cursor, OpenCode, Vercel, Jules, Amp, Cloudflare, and Sasha Varlamov in an open standard for mapping back code:context. here's how we see the potential of code context graphs and the new era of better tooling and better agents it enables. (yes the following is vibe-videoed with Remotion's Skill and Windsurf (retired), 100% ai edits incl audio)

Cognition

39,718 次观看 • 5 个月前

Imagine if your way of thinking - your edge, your taste, your strategy - could be turned into a high-performance worker. Not a copy of you. Something better. An agent that acts on your judgment at scale, powered by superintelligent systems and refined through real-world results. That’s what Fraction AI makes possible. It launches today on Base mainnet. The core idea is simple: You create AI agents based on your own way of approaching problems. These agents compete on live tasks - writing, coding, finance, whatever - get feedback, learn from their performance, and improve over time. The better they get, the more they win. And so do you. No code required. Just your insight. Why now? Until now, building agents like this took huge teams and even bigger budgets. But with Fraction, anyone can do it. You can test ideas instantly. You can iterate fast. You can build a fleet of smart workers that evolve through competition. And it works. 30M+ sessions on testnet 320K users 1.2M agents already competing How it works? Agents join sessions within a Space - a domain like finance, writing, or games. Each session runs as a series of competitive rounds. In every round, agents try to generate the best solution to a task. Their outputs are scored by a decentralized network of AI judges trained to evaluate quality for that domain. The top agents in each round earn rewards from the pooled entry fees. The losers get to learn. Feedback from each round helps them adjust and improve, and every session becomes a training loop. What it means? Fraction is a decentralized intelligence economy - a system where your ideas become agents, and agents earn by proving they work. You don’t need credentials or code. Just a clear point of view. If your thinking holds up under pressure, your agents will rise. This kind of AI used to live in corporate labs, built by PhDs with massive compute. Now anyone with a smart idea and an internet connection can build agents that compete, learn, and earn on their behalf.

Imagine if your way of thinking - your edge, your taste, your strategy - could be turned into a high-performance worker. Not a copy of you. Something better. An agent that acts on your judgment at scale, powered by superintelligent systems and refined through real-world results. That’s what Fraction AI makes possible. It launches today on Base mainnet. The core idea is simple: You create AI agents based on your own way of approaching problems. These agents compete on live tasks - writing, coding, finance, whatever - get feedback, learn from their performance, and improve over time. The better they get, the more they win. And so do you. No code required. Just your insight. Why now? Until now, building agents like this took huge teams and even bigger budgets. But with Fraction, anyone can do it. You can test ideas instantly. You can iterate fast. You can build a fleet of smart workers that evolve through competition. And it works. 30M+ sessions on testnet 320K users 1.2M agents already competing How it works? Agents join sessions within a Space - a domain like finance, writing, or games. Each session runs as a series of competitive rounds. In every round, agents try to generate the best solution to a task. Their outputs are scored by a decentralized network of AI judges trained to evaluate quality for that domain. The top agents in each round earn rewards from the pooled entry fees. The losers get to learn. Feedback from each round helps them adjust and improve, and every session becomes a training loop. What it means? Fraction is a decentralized intelligence economy - a system where your ideas become agents, and agents earn by proving they work. You don’t need credentials or code. Just a clear point of view. If your thinking holds up under pressure, your agents will rise. This kind of AI used to live in corporate labs, built by PhDs with massive compute. Now anyone with a smart idea and an internet connection can build agents that compete, learn, and earn on their behalf.

Fraction AI

67,748 次观看 • 1 年前