正在加载视频...
视频加载失败
Behavioral Foundation Models (BFMs) trained with RL are secretly more powerful than we think. BFM’s directly output a policy believed to be near-optimal given any reward function. Our new work shows that they can actually do much better:
8 条评论

2. BFMs learn generalizable representations that allow an embodied agent to act near-optimally for any reward fn by providing a mapping from reward to the corresponding near-optimal policy. They do this by using unsupervised pretraining algorithms like: PSM, FB, HILP, etc

3. However, unlike language and vision, RL has mostly operated in a tabula rasa fashion. We don’t have RL pretraining methods that can be fine-tuned rapidly for any task. Most RL methods unlearn when we start finetuning,an issue attributed often to miscalibration of value function.

4. Our first finding is striking: In the space of learned behaviors, the unsupervised RL pretraining based on successor features discovers behaviors that are much better than the ones that are output zero-shot. Below is a thorough evaluation on a number of environments and tasks:

5. Based on these findings, we present ways to rapidly fine-tune the zero-shot policy output by BFMs to improve performance on any downstream tasks. The algorithms are general, simple, task agnostic, and performant. The key idea is: Search in the latent space of behaviors.

6. Our proposed algorithms can adapt in 10's of episodes to achieve much better behaviors. Here's an example of search in the latent space of behaviors below to show how the policy evolves during adaptation.

7. This was work done during my internship at FAIR with wonderful collaborators A.Tirinzoni, A.Touati, @YingchenX , A. Kanervisto, @scottniekum,@yayitsamyzhang spearheaded by @alelazaric and @teopir. Paper (at RLC 2025) :

This is great looking work Harshit! Congratulations! I'm looking forward to catching up @RL_Conference later this summer!

@RL_Conference Looking forward to chat!

