Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

We developed a simple, sample-efficient online RL technique for post-training image generation models. We see it as a possible steerable alternative to CFG, driven by any scalar reward, including human preference.

David McAllister

1,020 subscribers

63,255 views • 1 month ago •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Comments

No comments available

Comments from the original post will appear here

Related Videos

I was frustrated by the image generation experiences in existing apps. Including ours. That's why I spent almost a year playing with alternative UX. Today, we finally have something to show for it. Introducing T3 Canvas, the world's best image generation experience.

I was frustrated by the image generation experiences in existing apps. Including ours. That's why I spent almost a year playing with alternative UX. Today, we finally have something to show for it. Introducing T3 Canvas, the world's best image generation experience.

Theo - t3.gg

305,667 views • 3 months ago

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

Physical Intelligence

428,531 views • 2 months ago

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning *beyond* just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning beyond just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

Abhishek Gupta

13,591 views • 1 year ago

Let's reverse engineer Disney's adorable, lifelike robot! I couldn't find a whitepaper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe. But it cannot be "rendered" directly on the robot because it doesn't take into account the complex real-world physics. 2. Reinforcement learning (RL) is a great tool for training low-level robot controllers. RL needs a reward function to optimize, and it's typically a task reward (e.g. walk in a straight line as fast as possible). The problem is that RL doesn't know what counts as "natural behavior", and often produces weird-looking body postures that somehow still maximize the reward. This is a human alignment problem just like ChatGPT. 3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute". In GAN literature, this is called a discriminator. Disney artists are good at creating such a dataset. You can then add AMP as an auxiliary reward in simulation to nudge the robot towards desired behaviors. AMP was developed by Peng et al. 2021 and Escontrela et al. 2022. 4. Add lots of data augmentation to make the controller robust to physical disturbances. In RL, it's called "domain randomization". This is a very powerful technique that bridges the gap between simulator and reality. Previously, OpenAI used domain randomization to train a 5-finger robot hand to manipulate a Rubik's Cube: IEEE news article gave hints about the pipeline: Finally, praying for world peace 🙏. I hope robotics like this will bring more joy to the world.

Let's reverse engineer Disney's adorable, lifelike robot! I couldn't find a whitepaper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe. But it cannot be "rendered" directly on the robot because it doesn't take into account the complex real-world physics. 2. Reinforcement learning (RL) is a great tool for training low-level robot controllers. RL needs a reward function to optimize, and it's typically a task reward (e.g. walk in a straight line as fast as possible). The problem is that RL doesn't know what counts as "natural behavior", and often produces weird-looking body postures that somehow still maximize the reward. This is a human alignment problem just like ChatGPT. 3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute". In GAN literature, this is called a discriminator. Disney artists are good at creating such a dataset. You can then add AMP as an auxiliary reward in simulation to nudge the robot towards desired behaviors. AMP was developed by Peng et al. 2021 and Escontrela et al. 2022. 4. Add lots of data augmentation to make the controller robust to physical disturbances. In RL, it's called "domain randomization". This is a very powerful technique that bridges the gap between simulator and reality. Previously, OpenAI used domain randomization to train a 5-finger robot hand to manipulate a Rubik's Cube: IEEE news article gave hints about the pipeline: Finally, praying for world peace 🙏. I hope robotics like this will bring more joy to the world.

Jim Fan

314,504 views • 2 years ago

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Lerrel Pinto

46,197 views • 1 year ago

Presenting DemoDiffusion: An extremely simple approach enabling a pre-trained 'generalist' diffusion policy to follow a human-demonstration for a novel task during inference One-shot human imitation *without* requiring any paired human-robot data or online RL 🙂 1/n

Presenting DemoDiffusion: An extremely simple approach enabling a pre-trained 'generalist' diffusion policy to follow a human-demonstration for a novel task during inference One-shot human imitation without requiring any paired human-robot data or online RL 🙂 1/n

Homanga Bharadhwaj

32,830 views • 11 months ago

Google DeepMind, David Silver reveals: we built a system that used RL to discover its own RL algorithms. this AI-designed system outperformed all human-created RL algorithms developed over the years.

Google DeepMind, David Silver reveals: we built a system that used RL to discover its own RL algorithms. this AI-designed system outperformed all human-created RL algorithms developed over the years.

Haider.

397,103 views • 1 year ago

Introducing ClickDiffusion! We developed a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface.

Introducing ClickDiffusion! We developed a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface.

Alec Helbling

36,293 views • 2 years ago

Everyone knows action chunking is great for imitation learning. It turns out that we can extend its success to RL to better leverage prior data for improved exploration and online sample efficiency! The recipe to achieve this is incredibly simple. 🧵 1/N

Everyone knows action chunking is great for imitation learning. It turns out that we can extend its success to RL to better leverage prior data for improved exploration and online sample efficiency! The recipe to achieve this is incredibly simple. 🧵 1/N

Qiyang (Colin) Li

48,231 views • 11 months ago

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

DeepLearning.AI

16,746 views • 10 months ago

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,311 views • 2 years ago

You'd think the race to AGI would mean training the biggest possible model. But parameter scaling had stalled for a long time after GPT-4's trillion+ parameters, and only now are models getting bigger again. What gives? Partially it’s RL scaling, as Dylan Patel explains. A 5T parameter model takes 5x longer to generate RL rollouts than a 1T model. Even if the bigger model is 2x more sample-efficient, the smaller model finishes RL faster, gets deployed to research sooner, and starts helping build the next model before the big one is even done training.

You'd think the race to AGI would mean training the biggest possible model. But parameter scaling had stalled for a long time after GPT-4's trillion+ parameters, and only now are models getting bigger again. What gives? Partially it’s RL scaling, as Dylan Patel explains. A 5T parameter model takes 5x longer to generate RL rollouts than a 1T model. Even if the bigger model is 2x more sample-efficient, the smaller model finishes RL faster, gets deployed to research sooner, and starts helping build the next model before the big one is even done training.

Dwarkesh Patel

65,123 views • 2 months ago

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

(1/2) LightIt: Illumination Modeling and Control for Diffusion Models! #CVPR2024 We facilitate lighting control for novel image generation from text prompts. We can also edit lighting for a given input image. Video: Project:

Matthias Niessner

19,835 views • 2 years ago

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 views • 11 months ago

We present MusicGen: A simple and controllable music generation model. MusicGen can be prompted by both text and melody. We release code (MIT) and models (CC-BY NC) for open research, reproducibility, and for the music community:

We present MusicGen: A simple and controllable music generation model. MusicGen can be prompted by both text and melody. We release code (MIT) and models (CC-BY NC) for open research, reproducibility, and for the music community:

Felix Kreuk

697,276 views • 3 years ago

We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.

We develop a method to test global opinions represented in language models. We find the opinions represented by the models are most similar to those of the participants in USA, Canada, and some European countries. We also show the responses are steerable in separate experiments.

Anthropic

261,392 views • 2 years ago

.Polymath is training world generation models to automate the creation of RL environments. Traditionally, RL environment generation has been bottlenecked by human data. Superintelligence will never be achieved by human data alone. Polymath is building the core technology to enable automated environment generation using far less human effort than traditionally required, and eventually none. This allows for more complex and realistic worlds, and higher quality, scale, and diversity of tasks. This will be essential to unlock RL scaling. The end goal is to create large-scale, long-horizon environments from a text description alone. This will enable the creation of worlds of arbitrary complexity and scale, which is foundational for training & evaluating autonomous, superintelligent AI agents. Congrats on the launch, Dylan Ma and Naren Yenuganti!

.Polymath is training world generation models to automate the creation of RL environments. Traditionally, RL environment generation has been bottlenecked by human data. Superintelligence will never be achieved by human data alone. Polymath is building the core technology to enable automated environment generation using far less human effort than traditionally required, and eventually none. This allows for more complex and realistic worlds, and higher quality, scale, and diversity of tasks. This will be essential to unlock RL scaling. The end goal is to create large-scale, long-horizon environments from a text description alone. This will enable the creation of worlds of arbitrary complexity and scale, which is foundational for training & evaluating autonomous, superintelligent AI agents. Congrats on the launch, Dylan Ma and Naren Yenuganti!

Y Combinator

44,192 views • 3 months ago

A conversation on the optimal reward for coding agents, infinite context models, and real-time RL

A conversation on the optimal reward for coding agents, infinite context models, and real-time RL

Cursor

317,672 views • 1 year ago

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

AK

194,356 views • 2 years ago