正在加载视频...
视频加载失败
Can AI agents adapt zero-shot, to complex multi-step language instructions in open-ended environments? We present MaestroMotif, a method for AI-assisted skill design that produces highly capable and steerable hierarchical agents. To the best of our knowledge, it is the first method that, without expert labeled datasets, solves compositional tasks... show more
80,217 次观看 • 1 年前 •via X (Twitter)
11 条评论

MaestroMotif builds on our previous work, Motif, which pioneered learning RL policies from AI feedback. At the time, it set a new state-of-the-art on the open-ended domain of NetHack. With MaestroMotif, we improve on this performance by two orders of magnitude. But, how are these gains obtained? In a couple words: from task decomposition.

MaestroMotif is a scalable and effective algorithm for AI-assisted skill design. It starts by leveraging an agent designer’s prior knowledge about a domain who defines a set of useful skills, or agents. Agents/skills here are described on a high level in natural language. MaestroMotif then proceeds to convert these descriptions into reward models through the process of AI feedback. These rewards encode a notion of good behaviour for each of the skills. MaestroMotif then plans, through in-context learning and unit-test feedback, a strategy for executing the skills in the environment. This strategy is instantiated in the form of code policy over skills.

Once the skill policies are learned, MaestroMotif can adapt, zero-shot, to new instructions and solve complex tasks simply by re-combining skills, similarly to motifs in a composition. In other words, it writes a different code policy over skills which achieves a completely different task.

We highlight the complexity of some of these tasks, which on average take more than a thousand steps for completion. Even methods that are trained specifically for each task are not able to make any kind of progress.

Evaluations in such complex tasks is only possibly thanks to the work of dedicated fans of NetHack, who have been building and upgrading the game since 1987 (it is still an ongoing and maintained repository). We show in this figure some of the complexities of NetHack. A few years back, AI researchers (@HeinrichKuttler , @egrefen and @_rockt to name a few) foresaw the importance of such an environment and created the @NetHack_LE , which allows for fast experimenting with RL agents on an incredibly complex environment.

has also recently been used within the Balrog benchmark ( from @PaglieriDavide @CupiaBart et al., which emphasizes the difficulty of current LLMs to perform well over long horizon tasks. In this benchmark, @NetHack_LE is undoubtedly the hardest domain. See this announcement:

An interesting discovery we came across was how the skills that were learned naturally emerged in a form of curriculum. To give more context, we used a single skill-conditioned neural network to learn all behaviours, and these behaviours were learned simultaneously. As a result, easier skills are the first to maximize their skill reward, paving the way for more complex skills to be learned. TL;DR: Hierarchy affords learnability.

Finally, we analyze the choice of the LLM used to write code policies. We notice a scaling behaviour wherein only the largest open-source LLM of the time, Llama 3.1 405b, was able to define policies that were successful on all tasks. With the advent of thinking models, it would be interesting to investigate their ability to orchestrate skills through code.

@twimlai @TalkRLPodcast @DrJimFan @nathanbenaich @_akhaliq @arankomatsuzaki @Mila_Quebec @AmiiThinks @AIatMeta @ylecun

4 ways #AI can help in a challenging market. Find out how your company can harness the potential of AI while minimizing risks and paving the way for more ambitious applications as the technology continues to develop. Learn more at @JLL. #ad

@_rockt Amazing Martin! Congratulations 🎉
