正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Behavioral Foundation Models (BFMs) trained with RL are secretly more powerful than we think. BFM’s directly output a policy believed to be near-optimal given any reward function. Our new work shows that they can actually do much better:

Harshit Sikchi

2,088 subscribers

44,182 次观看 • 1 年前 •via X (Twitter)

科学技术新闻政治教育

Anya Rossi• Live Now

Private livecam show

8 条评论

Harshit Sikchi 的头像

Harshit Sikchi1 年前

2. BFMs learn generalizable representations that allow an embodied agent to act near-optimally for any reward fn by providing a mapping from reward to the corresponding near-optimal policy. They do this by using unsupervised pretraining algorithms like: PSM, FB, HILP, etc

Harshit Sikchi 的头像

Harshit Sikchi1 年前

3. However, unlike language and vision, RL has mostly operated in a tabula rasa fashion. We don’t have RL pretraining methods that can be fine-tuned rapidly for any task. Most RL methods unlearn when we start finetuning,an issue attributed often to miscalibration of value function.

Harshit Sikchi 的头像

Harshit Sikchi1 年前

4. Our first finding is striking: In the space of learned behaviors, the unsupervised RL pretraining based on successor features discovers behaviors that are much better than the ones that are output zero-shot. Below is a thorough evaluation on a number of environments and tasks:

Harshit Sikchi 的头像

Harshit Sikchi1 年前

5. Based on these findings, we present ways to rapidly fine-tune the zero-shot policy output by BFMs to improve performance on any downstream tasks. The algorithms are general, simple, task agnostic, and performant. The key idea is: Search in the latent space of behaviors.

Harshit Sikchi 的头像

Harshit Sikchi1 年前

6. Our proposed algorithms can adapt in 10's of episodes to achieve much better behaviors. Here's an example of search in the latent space of behaviors below to show how the policy evolves during adaptation.

Harshit Sikchi 的头像

Harshit Sikchi1 年前

7. This was work done during my internship at FAIR with wonderful collaborators A.Tirinzoni, A.Touati, @YingchenX , A. Kanervisto, @scottniekum,@yayitsamyzhang spearheaded by @alelazaric and @teopir. Paper (at RLC 2025) :

Taylor W. Killian 的头像

Taylor W. Killian1 年前

This is great looking work Harshit! Congratulations! I'm looking forward to catching up @RL_Conference later this summer!

Harshit Sikchi 的头像

Harshit Sikchi1 年前

@RL_Conference Looking forward to chat!

相关视频

Ben Horowitz on foundation models as infrastructure: “People believed the big foundation models would be giant brains that could do anything better than anybody.” “It has not played out quite like that.” “Maybe the application behavior is actually more important than having the biggest model trained with the most GPUs.” benahorowitz.eth Jen Kha

Ben Horowitz on foundation models as infrastructure: “People believed the big foundation models would be giant brains that could do anything better than anybody.” “It has not played out quite like that.” “Maybe the application behavior is actually more important than having the biggest model trained with the most GPUs.” benahorowitz.eth Jen Kha

a16z

46,594 次观看 • 5 个月前

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

Anthropic

2,482,566 次观看 • 7 个月前

We are moving from model controller to RL for locomotion. The policy is starting to look quite robust. Even more convinced now that legs are easier than wheels, given the software complexity has shrunk tremendously with RL.

We are moving from model controller to RL for locomotion. The policy is starting to look quite robust. Even more convinced now that legs are easier than wheels, given the software complexity has shrunk tremendously with RL.

Sankaet

133,130 次观看 • 9 个月前

Meet BFM-Zero: A Promptable Humanoid Behavioral Foundation Model w/ Unsupervised RL👉 🧩ONE latent space for ALL tasks ⚡Zero-shot goal reaching, tracking, and reward optimization (any reward at test time), from ONE policy 🤖Natural recovery & transition

Meet BFM-Zero: A Promptable Humanoid Behavioral Foundation Model w/ Unsupervised RL👉 🧩ONE latent space for ALL tasks ⚡Zero-shot goal reaching, tracking, and reward optimization (any reward at test time), from ONE policy 🤖Natural recovery & transition

Yitang Li

81,258 次观看 • 7 个月前

A conversation on the optimal reward for coding agents, infinite context models, and real-time RL

A conversation on the optimal reward for coding agents, infinite context models, and real-time RL

Cursor

317,814 次观看 • 1 年前

throwback to a funny yuno and harry clip a pretty rare pairing but they are actually pretty funny together whenever they do pair up. we got to see them work together a bit more than usual in early 4.0... probably not much any more

throwback to a funny yuno and harry clip a pretty rare pairing but they are actually pretty funny together whenever they do pair up. we got to see them work together a bit more than usual in early 4.0... probably not much any more

karrot

14,162 次观看 • 1 年前

SORKIN: I get so many more videos of near-miss incidents in a way that I don't think I saw before. Is that a function of more people with phones out or are there really that many more near misses these days? SEAN DUFFY: I think it's a function of maybe both. We have to have our pilots pay attention.

SORKIN: I get so many more videos of near-miss incidents in a way that I don't think I saw before. Is that a function of more people with phones out or are there really that many more near misses these days? SEAN DUFFY: I think it's a function of maybe both. We have to have our pilots pay attention.

Aaron Rupar

43,668 次观看 • 10 个月前

Our model can now learn from its own experience with RL! Our new π*0.6 model can more than double throughput over a base model trained without RL, and can perform real-world tasks: making espresso drinks, folding diverse laundry, and assembling boxes. More in the thread below.

Our model can now learn from its own experience with RL! Our new π*0.6 model can more than double throughput over a base model trained without RL, and can perform real-world tasks: making espresso drinks, folding diverse laundry, and assembling boxes. More in the thread below.

Physical Intelligence

704,285 次观看 • 7 个月前

"Our critics can say whatever they want because they don't know any better; their arithmetic has always been wrong. We understand financial engineering and progressive methodology much better than they do, which is why we are here." - President Tinubu.

"Our critics can say whatever they want because they don't know any better; their arithmetic has always been wrong. We understand financial engineering and progressive methodology much better than they do, which is why we are here." - President Tinubu.

Imran Muhammad

143,541 次观看 • 8 个月前

Are Iran and Russia are our enemies because they are more evil than our allies? Any country that doesn’t do what we want them to do, that’s an enemy

Are Iran and Russia are our enemies because they are more evil than our allies? Any country that doesn’t do what we want them to do, that’s an enemy

Megatron

31,022 次观看 • 22 天前

Here’s a message from our new Creative Director Nathan. To any artists engaged with us, if there's any way you think we can support you during this time, please do get in touch. Details of the support available from our partners Anthony Walker Foundation can be found below.

Here’s a message from our new Creative Director Nathan. To any artists engaged with us, if there's any way you think we can support you during this time, please do get in touch. Details of the support available from our partners Anthony Walker Foundation can be found below.

Liverpool Everyman & Playhouse theatres

10,463 次观看 • 1 年前

Sam Altman says we are sitting on a capability overhang The models we already have are more powerful than most people realize, and they are about to get much better this year We've been figuring out how to communicate what's coming without sounding like hype

Sam Altman says we are sitting on a capability overhang The models we already have are more powerful than most people realize, and they are about to get much better this year We've been figuring out how to communicate what's coming without sounding like hype

Haider.

55,940 次观看 • 4 个月前

Farah Maalim: Our doctors are trained better than any other doctors in the world. I can confidently say this. When you see an intern doctor in Kenya, they do surgeries that only final-year trainees or specialists in other countries can do. We have a lot to do in the field of medicine. We must take good care of our doctors after they finish their training, and internships should be well paid. #MorningPrime Debarl Inea

Farah Maalim: Our doctors are trained better than any other doctors in the world. I can confidently say this. When you see an intern doctor in Kenya, they do surgeries that only final-year trainees or specialists in other countries can do. We have a lot to do in the field of medicine. We must take good care of our doctors after they finish their training, and internships should be well paid. #MorningPrime Debarl Inea

KTN News

141,714 次观看 • 2 年前

Lakhs and crores of animals suffer in silence. Their continuous pain is a stark reminder that our ignorance and inaction is their biggest enemy. They need to be seen as more than just a statistic; they need trained and effective animal advocates to make their voices matter. Do you have the courage, commitment and grit to work for a more compassionate world? Apply for the Ahimsa Fellowship and become a powerful advocate for animals. They equip you with the tools to work for stronger laws, better policies, and a future free from needless suffering. Apply now and make a real difference: Ahimsa Fellows People For Animals Public Policy Foundation People For Animals Uttarakhand Gauri Maulekhi

Lakhs and crores of animals suffer in silence. Their continuous pain is a stark reminder that our ignorance and inaction is their biggest enemy. They need to be seen as more than just a statistic; they need trained and effective animal advocates to make their voices matter. Do you have the courage, commitment and grit to work for a more compassionate world? Apply for the Ahimsa Fellowship and become a powerful advocate for animals. They equip you with the tools to work for stronger laws, better policies, and a future free from needless suffering. Apply now and make a real difference: Ahimsa Fellows People For Animals Public Policy Foundation People For Animals Uttarakhand Gauri Maulekhi

John Abraham

71,712 次观看 • 2 年前

When middle powers work together, we can do even more than protect ourselves and our sovereignty — we can build something better, more prosperous, and more just than what came before.

When middle powers work together, we can do even more than protect ourselves and our sovereignty — we can build something better, more prosperous, and more just than what came before.

Mark Carney

148,936 次观看 • 3 个月前

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg 🚂 and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

RL is a powerful mechanism for training company-specific models on their unique work and data. This is what we do at Applied Compute. A key challenge is how to make RL efficient, because we need runs to be fast (delivered in days), cheap (scalable unit economics), and predictable (not just fast, but reliably fast). Here are some takeaways: • Synchronous RL is wasteful with time and compute. • Asynchronous RL is more efficient but introduces staleness, which causes learning instabilities. • Modeling and simulations can help analytically solve for what configuration leads to optimal efficiency. This allows us to rapidly prototype training configurations, without burning expensive compute cycles on trial runs. Two of our co-founders, Rhythm Garg 🚂 and Linden Li, discussed some of this research at AI Engineer recently, with a focus on the following subproblem: what is the highest throughput way to do RL given a maximum staleness and compute budget?

Applied Compute

45,480 次观看 • 6 个月前

Miller: You know, the gang bangers that you deal with, they think that they are ruthless. They have no idea how ruthless we are. They think they are tough. They have no idea how tough we are. They think that they are hard-core. We are so much more hard-core than they are, and we have the entire weight of the US government behind us. What do they have?

Miller: You know, the gang bangers that you deal with, they think that they are ruthless. They have no idea how ruthless we are. They think they are tough. They have no idea how tough we are. They think that they are hard-core. We are so much more hard-core than they are, and we have the entire weight of the US government behind us. What do they have?

Acyn

256,889 次观看 • 8 个月前

Cancer ♋️🎂 Our feelings surrounding a situation can actually do more harm than good if they’re interfering with our ability to see the facts. Be careful of catastrophizing situations that may appear worse than they truly our. Our emotions/trauma can play tricks on us sometimes.

Cancer ♋️🎂 Our feelings surrounding a situation can actually do more harm than good if they’re interfering with our ability to see the facts. Be careful of catastrophizing situations that may appear worse than they truly our. Our emotions/trauma can play tricks on us sometimes.

Bronxology 👑

147,804 次观看 • 3 年前

So you’ve trained your favorite diffusion/flow based policy, but it’s just not good enough 0-shot. Worry not, in our new work DSRL - we show how to *steer* pre-trained diffusion policies with off-policy RL, improving behavior efficiently enough for direct training in the real world! DSRL retains nice exploration from the base policy, but allows for quick improvement beyond this base policy with RL. The method is frustratingly simple, and super easy to throw on top of your favorite pretrained policy (VLA/diffusion policy, etc). Let’s think about how it works, 🧵 (1/10)

So you’ve trained your favorite diffusion/flow based policy, but it’s just not good enough 0-shot. Worry not, in our new work DSRL - we show how to steer pre-trained diffusion policies with off-policy RL, improving behavior efficiently enough for direct training in the real world! DSRL retains nice exploration from the base policy, but allows for quick improvement beyond this base policy with RL. The method is frustratingly simple, and super easy to throw on top of your favorite pretrained policy (VLA/diffusion policy, etc). Let’s think about how it works, 🧵 (1/10)

Abhishek Gupta

19,035 次观看 • 1 年前

We've raised $7M to help companies build AI agents that actually learn and work. Osmosis (YC W25) is a platform for companies to fine-tune models that outperform foundation models with reinforcement learning. Better, faster, and cheaper.

We've raised $7M to help companies build AI agents that actually learn and work. Osmosis (YC W25) is a platform for companies to fine-tune models that outperform foundation models with reinforcement learning. Better, faster, and cheaper.

Kasey Zhang

1,215,152 次观看 • 8 个月前