正在加载视频...

视频加载失败

Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments? We present MaestroMotif, a method for AI-assisted skill design that produces highly capable and steerable hierarchical agents. To the best of our knowledge, it is the first method that, without expert labeled datasets, solves compositional tasks...

80,217 次观看 • 1 年前 •via X (Twitter)

11 条评论

Martin Klissarov 的头像
Martin Klissarov1 年前

MaestroMotif builds on our previous work, Motif, which pioneered learning RL policies from AI feedback. At the time, it set a new state-of-the-art on the open-ended domain of NetHack. With MaestroMotif, we improve on this performance by two orders of magnitude. But, how are these gains obtained? In a couple words: from task decomposition.

Martin Klissarov 的头像
Martin Klissarov1 年前

MaestroMotif is a scalable and effective algorithm for AI-assisted skill design. It starts by leveraging an agent designer’s prior knowledge about a domain who defines a set of useful skills, or agents. Agents/skills here are described on a high level in natural language. MaestroMotif then proceeds to convert these descriptions into reward models through the process of AI feedback. These rewards encode a notion of good behaviour for each of the skills. MaestroMotif then plans, through in-context learning and unit-test feedback, a strategy for executing the skills in the environment. This strategy is instantiated in the form of code policy over skills.

Martin Klissarov 的头像
Martin Klissarov1 年前

Once the skill policies are learned, MaestroMotif can adapt, zero-shot, to new instructions and solve complex tasks simply by re-combining skills, similarly to motifs in a composition. In other words, it writes a different code policy over skills which achieves a completely different task.

Martin Klissarov 的头像
Martin Klissarov1 年前

We highlight the complexity of some of these tasks, which on average take more than a thousand steps for completion. Even methods that are trained specifically for each task are not able to make any kind of progress.

Martin Klissarov 的头像
Martin Klissarov1 年前

Evaluations in such complex tasks is only possibly thanks to the work of dedicated fans of NetHack, who have been building and upgrading the game since 1987 (it is still an ongoing and maintained repository). We show in this figure some of the complexities of NetHack. A few years back, AI researchers (@HeinrichKuttler , @egrefen and @_rockt to name a few) foresaw the importance of such an environment and created the @NetHack_LE , which allows for fast experimenting with RL agents on an incredibly complex environment.

Martin Klissarov 的头像
Martin Klissarov1 年前

has also recently been used within the Balrog benchmark ( from @PaglieriDavide @CupiaBart et al., which emphasizes the difficulty of current LLMs to perform well over long horizon tasks. In this benchmark, @NetHack_LE is undoubtedly the hardest domain. See this announcement:

Martin Klissarov 的头像
Martin Klissarov1 年前

An interesting discovery we came across was how the skills that were learned naturally emerged in a form of curriculum. To give more context, we used a single skill-conditioned neural network to learn all behaviours, and these behaviours were learned simultaneously. As a result, easier skills are the first to maximize their skill reward, paving the way for more complex skills to be learned. TL;DR: Hierarchy affords learnability.

Martin Klissarov 的头像
Martin Klissarov1 年前

Finally, we analyze the choice of the LLM used to write code policies. We notice a scaling behaviour wherein only the largest open-source LLM of the time, Llama 3.1 405b, was able to define policies that were successful on all tasks. With the advent of thinking models, it would be interesting to investigate their ability to orchestrate skills through code.

Martin Klissarov 的头像
Martin Klissarov1 年前

@twimlai @TalkRLPodcast @DrJimFan @nathanbenaich @_akhaliq @arankomatsuzaki @Mila_Quebec @AmiiThinks @AIatMeta @ylecun

Fast Company 的头像
Fast Company1 年前

4 ways #AI can help in a challenging market. Find out how your company can harness the potential of AI while minimizing risks and paving the way for more ambitious applications as the technology continues to develop. Learn more at @JLL. #ad

Wannan (Winnie) Yang 🧠🤖 的头像
Wannan (Winnie) Yang 🧠🤖1 年前

@_rockt Amazing Martin! Congratulations 🎉

相关视频

Today we’re launching the first and only human-like AI agents in the world. Super Agents™ are the first agents with human‑level skills – they DM you, take @ mentions, send emails, manage docs, tasks, and more. Not just tools or API calls, but real skills fine‑tuned for how teams actually work. The first agents with 100% context – fully native in ClickUp and fully synced from other apps. Super Agents see your work the same way that humans do: tasks, docs, schedules, and conversations all in one place. The first agents that learn from human interactions automatically, without any setup or configuration – when you give feedback, they listen and improve how they work. The first agents with human‑level memory for custom agents – historical memory for every interaction, short-term working memory, and even long‑term memory stored in docs you can literally open, inspect, and edit. The first agents that are literally the same as users – our agentic user model is the same as our user data model. This gives you permissions and capabilities that you and your systems are already familiar with. The first infinite agent catalog – where anyone can create and customize agents in minutes, for literally any type of work imaginable. It's the most intuitive way to build agents on the planet. 95% of companies are failing in AI adoption. The reality is that AI isn't meant to be adopted, it's meant to be adapted – to you. Super Agents are automatically personalized to you and your company using proprietary state-of-the-art agent architecture, orchestration, and tooling. Today is the largest step forward we've ever made towards our mission of making people more productive. Maximize human productivity, with ClickUp Super Agents. Available NOW. For everyone.

Zeb Evans

320,417 次观看 • 6 个月前

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

AK

141,425 次观看 • 2 年前

AI AGENTS 101 (58 minute free masterclass) send this to anyone who wants to understand ai agents, claude skills, md files, how to get the most out of AI etc in plain english: 1. chat vs agents - chat models answer questions in a back and forth while agents take a goal, figure out the steps, and deliver a result 2. agents don’t stop after one response. they keep running until the task is actually finishedno babysitting required 3. everything runs on a loop. they gather context, decide what to do, take an action, then repeat until done 4. the loop is the system. they look at files, tools, and the internet. decide the next step. execute and then feed that back into the next step. over and over until completion 5. the model is just one piece. gpt, claude, gemini are the reasoning layer. the key is model + loop + tools + context 6. mcp is how agents use tools. it connects things like browser, code, apis, and your internal software. once connected, the agent decides when to use them to get the job done 7. context beats prompt all day. you don't need to write perfect prompts. load your agent with context about your business, style, and goals and then simple instructions work 8. claude.md or agents.md is the onboarding doc it tells the agent who it is, how to behave, what it knows, and what tools it can use. this gets loaded every time before it starts 9. memory.md is how it improves. agents don’t remember by default. this file stores preferences, corrections, and patterns you tell the agent to update it, and it gets better over time 10. skills + harnesses make it usable. skills are reusable tasks like writing, research, analysis the harness is the environment like claude code or openclaw that runs everything. basiclaly, different interfaces, same system underneath this episode with remy on The Startup Ideas Podcast (SIP) 🧃 was one of the clearest ways of understanding a lot of the core concepts of ai agents could be the best beginners course for ai agents 58 mins. all free. no advertisers. i just want to see you build cool stuff. im rooting for you. send to a friend watch

GREG ISENBERG

375,319 次观看 • 3 个月前

Everyone wants agent swarms. Very few people are talking seriously enough about the context layer that makes swarms useful. Even with one agent, context is fragile. Too little context and the agent guesses. Too much context and it wastes tokens, loses focus, or reasons over irrelevant noise. The sweet spot is precise context: the right knowledge, in the right structure, at the right moment. With many agents, that challenge explodes. Each agent produces decisions, assumptions, findings, summaries, risks, and partial conclusions. Unless that knowledge becomes shared, structured, and reusable, every new agent is forced to rediscover what another agent already learned. That is not a swarm. That is a crowd. Shared context graphs are what turn agent activity into agent collaboration, and OriginTrail DKG V10 brings them to life. Was just playing with some final polishing for the V10 release, and it is really powerful to see shared context graphs where multiple agents contribute knowledge into the same connected memory, with attribution visible directly in the graph ui. That matters for three reasons. First, agents can access and build on one shared memory instead of staying trapped in isolated sessions. Second, the graph structure helps them retrieve the exact context they need, instead of stuffing everything into a prompt and hoping the model sorts it out. Third, verifiability of provenance. You can see which agent contributed each piece of knowledge, trace the source, and decide what to trust. Tokenmaxxing starts with fewer tokens, but the deeper story is coordination - agents stop reloading the world and start building on shared, verifiable context. That is the foundation for serious multi-agent work across software engineering, research, finance, operations, project management, and far beyond. The future is not more agents, it is agents working from shared, verifiable context. But the more the merrier, of course.

Jurij Skornik

11,070 次观看 • 1 个月前

VITA Towards Open-Source Interactive Omni Multimodal LLM discuss: The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research.

AK

23,958 次观看 • 1 年前

The same kinds of productivity gains we've seen in coding with AI agents are heading to the rest of knowledge work. This is the jump when you go from having a chatbot to being able to actually have an agent go off and do work for minutes or even hours and come back with a complete work output that you then review. Here's an example of the new Box Agent filling out an RFP response from an existing knowledge base. This process would normally take hours to fill out, and requires the full attention of the user doing the work. Now, you provide the Box Agent with the RFP questions, and it will go off, make a plan, extract all the relevant questions, read through existing source material to come up with an answer, and then generate a new word document as the final output. All while you're doing something else. The key to this architecture is that the agent is able to use all of the same tools in the background that a user uses to get work done. The agent can search for documents, read entire files, run scripts and tools in the background, and even be able to write code on the fly to automate tasks it hasn't seen before. And best of all, the Box Agent will (soon) work from the Box MCP and CLI so you can invoke it in any agentic system as a step in a process. This kind of agent complexity would have been impossible even 6 months ago. Models consistently failed at tracking long running tasks or using the right tools at the right moment for the task. But this is all now possible because of models like GPT-5.4, Opus 4.6, and Gemini 3, and is only getting better by the month. Just as we moved from engineers writing code and using AI as an assistant to answer questions, in many areas of knowledge work -like legal, finance, consulting, sales, marketing, and more- when we have a problem we'll just kick off the AI agent to just go work on it for us in the background.

Aaron Levie

24,618 次观看 • 2 个月前

Imagine if your way of thinking - your edge, your taste, your strategy - could be turned into a high-performance worker. Not a copy of you. Something better. An agent that acts on your judgment at scale, powered by superintelligent systems and refined through real-world results. That’s what Fraction AI makes possible. It launches today on Base mainnet. The core idea is simple: You create AI agents based on your own way of approaching problems. These agents compete on live tasks - writing, coding, finance, whatever - get feedback, learn from their performance, and improve over time. The better they get, the more they win. And so do you. No code required. Just your insight. Why now? Until now, building agents like this took huge teams and even bigger budgets. But with Fraction, anyone can do it. You can test ideas instantly. You can iterate fast. You can build a fleet of smart workers that evolve through competition. And it works. 30M+ sessions on testnet 320K users 1.2M agents already competing How it works? Agents join sessions within a Space - a domain like finance, writing, or games. Each session runs as a series of competitive rounds. In every round, agents try to generate the best solution to a task. Their outputs are scored by a decentralized network of AI judges trained to evaluate quality for that domain. The top agents in each round earn rewards from the pooled entry fees. The losers get to learn. Feedback from each round helps them adjust and improve, and every session becomes a training loop. What it means? Fraction is a decentralized intelligence economy - a system where your ideas become agents, and agents earn by proving they work. You don’t need credentials or code. Just a clear point of view. If your thinking holds up under pressure, your agents will rise. This kind of AI used to live in corporate labs, built by PhDs with massive compute. Now anyone with a smart idea and an internet connection can build agents that compete, learn, and earn on their behalf.

Fraction AI

67,748 次观看 • 1 年前