Loading video...

Video Failed to Load

Go Home

🏗️ Policy Adaptation from Foundation Model Feedback #CVPR2023 Instead of using foundation model as a pre-trained encoder (generator), we use it as a Teacher (discriminator) to tell where our policy did wrong and helps it adapts to new envs and tasks.

24,396 views • 3 years ago •via X (Twitter)

5 Comments

Xiaolong Wang's profile picture
Xiaolong Wang3 years ago

Work was led by Yuying @tttoaster_ when she was interning in our lab at UCSD, collaborating with @anna_macalus on working with the robots. Please check arxiv: Full video introduction:

Ted Xiao's profile picture
Ted Xiao3 years ago

Nice work! Using VLMs unlocks automated feedback, which is often quite expensive to produce by humans For another style of this same method (language feedback from CLIP) but in the offline setting, check out our work DIAL:

Xiaolong Wang's profile picture
Xiaolong Wang3 years ago

Thanks for sharing. Yes, definitely quite relevant! Very interesting idea on applying to offline setting. I think the shared core idea is we are all using VLM to provide some sort of reward signals.

Rishabh Agarwal's profile picture
Rishabh Agarwal3 years ago

Nice, looks very relevant to Reincarnating RL:

Xiaolong Wang's profile picture
Xiaolong Wang3 years ago

Thank you for making the connection. Yes, I think we have a bit of flavor on progressively learning. The focus has been on VLM for supervision but extending more on the progressive learning direction can actually be quite an interesting ... hmm

Related Videos

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 views • 11 months ago