Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

We developed a simple, sample-efficient online RL technique for post-training image generation models. We see it as a possible steerable alternative to CFG, driven by any scalar reward, including human preference.

David McAllister

1,104 subscribers

66,098 просмотров • 3 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

I was frustrated by the image generation experiences in existing apps. Including ours. That's why I spent almost a year playing with alternative UX. Today, we finally have something to show for it. Introducing T3 Canvas, the world's best image generation experience.

I was frustrated by the image generation experiences in existing apps. Including ours. That's why I spent almost a year playing with alternative UX. Today, we finally have something to show for it. Introducing T3 Canvas, the world's best image generation experience.

Theo - t3.gg

305,667 просмотров • 5 месяцев назад

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

Physical Intelligence

435,870 просмотров • 4 месяцев назад

Introducing Modality Forcing, a recipe for post-training T2I models for SOTA RGB-Depth generation! Text-to-image (T2I) models learn rich representations of the spatial world. How do we build on this prior for high-quality depth generation? 🧵 [1/6]

Introducing Modality Forcing, a recipe for post-training T2I models for SOTA RGB-Depth generation! Text-to-image (T2I) models learn rich representations of the spatial world. How do we build on this prior for high-quality depth generation? 🧵 [1/6]

Bardienus Duisterhof

64,543 просмотров • 1 месяц назад

Google presents VLOGGER Multimodal Diffusion for Embodied Avatar Synthesis We propose VLOGGER, a method for audio-driven human video generation from a single input image of a person, which builds on the success of recent generative diffusion models. Our method consists of

AK

66,375 просмотров • 2 лет назад

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning *beyond* just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

So we did a bunch of projects with real world reinforcement learning - but it was often too inefficient to be practical to train tabula rasa. This suggests we need better priors, but acquiring these from on-robot data can often be expensive as well. In our recent work, we show that despite being fundamentally inaccurate, simulation can guide provide a cheap way to guide real-world RL finetuning to be super efficient! We propose Simulation-Guided Fine-Tuning (SGFT) - a simple paradigm for sim2real finetuning that uses simulation to provide reward shaping that accelerates real world RL finetuning beyond just providing an initialization. TLDR: Use value functions from sim to shape rewards for real-world RL, see large sample efficiency improvements 🧵(1/6)

Abhishek Gupta

13,637 просмотров • 1 год назад

Let's reverse engineer Disney's adorable, lifelike robot! I couldn't find a whitepaper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe. But it cannot be "rendered" directly on the robot because it doesn't take into account the complex real-world physics. 2. Reinforcement learning (RL) is a great tool for training low-level robot controllers. RL needs a reward function to optimize, and it's typically a task reward (e.g. walk in a straight line as fast as possible). The problem is that RL doesn't know what counts as "natural behavior", and often produces weird-looking body postures that somehow still maximize the reward. This is a human alignment problem just like ChatGPT. 3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute". In GAN literature, this is called a discriminator. Disney artists are good at creating such a dataset. You can then add AMP as an auxiliary reward in simulation to nudge the robot towards desired behaviors. AMP was developed by Peng et al. 2021 and Escontrela et al. 2022. 4. Add lots of data augmentation to make the controller robust to physical disturbances. In RL, it's called "domain randomization". This is a very powerful technique that bridges the gap between simulator and reality. Previously, OpenAI used domain randomization to train a 5-finger robot hand to manipulate a Rubik's Cube: IEEE news article gave hints about the pipeline: Finally, praying for world peace 🙏. I hope robotics like this will bring more joy to the world.

Let's reverse engineer Disney's adorable, lifelike robot! I couldn't find a whitepaper, but this is how I think it's trained: 1. The emotional behaviors are curated by Disney animation artists, keyframe by keyframe. But it cannot be "rendered" directly on the robot because it doesn't take into account the complex real-world physics. 2. Reinforcement learning (RL) is a great tool for training low-level robot controllers. RL needs a reward function to optimize, and it's typically a task reward (e.g. walk in a straight line as fast as possible). The problem is that RL doesn't know what counts as "natural behavior", and often produces weird-looking body postures that somehow still maximize the reward. This is a human alignment problem just like ChatGPT. 3. Enters Adversarial Motion Prior (AMP): a technique that learns the human preference by training a classifier on what we consider "emotional & cute". In GAN literature, this is called a discriminator. Disney artists are good at creating such a dataset. You can then add AMP as an auxiliary reward in simulation to nudge the robot towards desired behaviors. AMP was developed by Peng et al. 2021 and Escontrela et al. 2022. 4. Add lots of data augmentation to make the controller robust to physical disturbances. In RL, it's called "domain randomization". This is a very powerful technique that bridges the gap between simulator and reality. Previously, OpenAI used domain randomization to train a 5-finger robot hand to manipulate a Rubik's Cube: IEEE news article gave hints about the pipeline: Finally, praying for world peace 🙏. I hope robotics like this will bring more joy to the world.

Jim Fan

314,654 просмотров • 2 лет назад

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Wouldn't it be great if we could train robots without any teleoperation! In our latest paper, we train robots to mimic a human video of the task by simply matching the object features using RL. We only need one video and under an hour of robot training.

Lerrel Pinto

46,221 просмотров • 1 год назад

Presenting DemoDiffusion: An extremely simple approach enabling a pre-trained 'generalist' diffusion policy to follow a human-demonstration for a novel task during inference One-shot human imitation *without* requiring any paired human-robot data or online RL 🙂 1/n

Presenting DemoDiffusion: An extremely simple approach enabling a pre-trained 'generalist' diffusion policy to follow a human-demonstration for a novel task during inference One-shot human imitation without requiring any paired human-robot data or online RL 🙂 1/n

Homanga Bharadhwaj

32,830 просмотров • 1 год назад

RobotMDM, by Disney Research, combines diffusion-based motion generation with RL to produce physics-aware humanoid motions from text prompts. Trained on human motion data with a reward surrogate for physical feasibility, it ensures realistic motions.

RobotMDM, by Disney Research, combines diffusion-based motion generation with RL to produce physics-aware humanoid motions from text prompts. Trained on human motion data with a reward surrogate for physical feasibility, it ensures realistic motions.

The Humanoid Hub

22,943 просмотров • 1 год назад

Introducing ClickDiffusion! We developed a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface.

Introducing ClickDiffusion! We developed a system for precise image manipulation and generation that combines natural language instructions with visual feedback provided by the user through a direct manipulation interface.

Alec Helbling

36,296 просмотров • 2 лет назад

Everyone knows action chunking is great for imitation learning. It turns out that we can extend its success to RL to better leverage prior data for improved exploration and online sample efficiency! The recipe to achieve this is incredibly simple. 🧵 1/N

Everyone knows action chunking is great for imitation learning. It turns out that we can extend its success to RL to better leverage prior data for improved exploration and online sample efficiency! The recipe to achieve this is incredibly simple. 🧵 1/N

Qiyang (Colin) Li

48,231 просмотров • 1 год назад

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

MotionGPT: Human Motion as a Foreign Language paper page: Though the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this "motion vocabulary", we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.

AK

125,319 просмотров • 3 лет назад

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

DeepLearning.AI

16,771 просмотров • 1 год назад

You'd think the race to AGI would mean training the biggest possible model. But parameter scaling had stalled for a long time after GPT-4's trillion+ parameters, and only now are models getting bigger again. What gives? Partially it’s RL scaling, as Dylan Patel explains. A 5T parameter model takes 5x longer to generate RL rollouts than a 1T model. Even if the bigger model is 2x more sample-efficient, the smaller model finishes RL faster, gets deployed to research sooner, and starts helping build the next model before the big one is even done training.

You'd think the race to AGI would mean training the biggest possible model. But parameter scaling had stalled for a long time after GPT-4's trillion+ parameters, and only now are models getting bigger again. What gives? Partially it’s RL scaling, as Dylan Patel explains. A 5T parameter model takes 5x longer to generate RL rollouts than a 1T model. Even if the bigger model is 2x more sample-efficient, the smaller model finishes RL faster, gets deployed to research sooner, and starts helping build the next model before the big one is even done training.

Dwarkesh Patel

65,123 просмотров • 3 месяцев назад

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 просмотров • 1 год назад

We present MusicGen: A simple and controllable music generation model. MusicGen can be prompted by both text and melody. We release code (MIT) and models (CC-BY NC) for open research, reproducibility, and for the music community:

We present MusicGen: A simple and controllable music generation model. MusicGen can be prompted by both text and melody. We release code (MIT) and models (CC-BY NC) for open research, reproducibility, and for the music community:

Felix Kreuk

697,308 просмотров • 3 лет назад

.Polymath is training world generation models to automate the creation of RL environments. Traditionally, RL environment generation has been bottlenecked by human data. Superintelligence will never be achieved by human data alone. Polymath is building the core technology to enable automated environment generation using far less human effort than traditionally required, and eventually none. This allows for more complex and realistic worlds, and higher quality, scale, and diversity of tasks. This will be essential to unlock RL scaling. The end goal is to create large-scale, long-horizon environments from a text description alone. This will enable the creation of worlds of arbitrary complexity and scale, which is foundational for training & evaluating autonomous, superintelligent AI agents. Congrats on the launch, Dylan Ma and Naren Yenuganti!

.Polymath is training world generation models to automate the creation of RL environments. Traditionally, RL environment generation has been bottlenecked by human data. Superintelligence will never be achieved by human data alone. Polymath is building the core technology to enable automated environment generation using far less human effort than traditionally required, and eventually none. This allows for more complex and realistic worlds, and higher quality, scale, and diversity of tasks. This will be essential to unlock RL scaling. The end goal is to create large-scale, long-horizon environments from a text description alone. This will enable the creation of worlds of arbitrary complexity and scale, which is foundational for training & evaluating autonomous, superintelligent AI agents. Congrats on the launch, Dylan Ma and Naren Yenuganti!

Y Combinator

44,192 просмотров • 4 месяцев назад

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

Champ Controllable and Consistent Human Image Animation with 3D Parametric Guidance In this study, we introduce a methodology for human image animation by leveraging a 3D human parametric model within a latent diffusion framework to enhance shape alignment and motion

AK

194,356 просмотров • 2 лет назад

We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.

We discovered an emergent property of VLAs like π0/π0.5/π0.6: as we scale up pre-training, the model learns to align human videos and robot data! This gives us a simple way to leverage human videos. Once π0.5 knows how to control robots, it can naturally learn from human video.

Physical Intelligence

1,183,087 просмотров • 7 месяцев назад

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

Meta announces Movie Gen A Cast of Media Foundation Models We present Movie Gen, a cast of foundation models that generates high-quality, 1080p HD videos with different aspect ratios and synchronized audio. We also show additional capabilities such as precise instruction-based video editing and generation of personalized videos based on a user’s image. Our models set a new state-of-the-art on multiple tasks: text-to-video synthesis, video personalization, video editing, video-to-audio generation, and text-to-audio generation. Our largest video generation model is a 30B parameter transformer trained with a maximum context length of 73K video tokens, corresponding to a generated video of 16 seconds at 16 frames-per-second. We show multiple technical innovations and simplifications on the architecture, latent spaces, training objectives and recipes, data curation, evaluation protocols, parallelization techniques, and inference optimizations that allow us to reap the benefits of scaling pre-training data, model size, and training compute for training large scale media generation models. We hope this paper helps the research community to accelerate progress and innovation in media generation models

AK

62,719 просмотров • 1 год назад