Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

1/ Au revoir, RLVR. New work: EBFT (Energy-Based Fine-Tuning), a post-training method that directly optimizes the long-horizon behavior of model generations, addressing SFT’s deployment-time error amplification without relying on sparse, task-specific rewards.

Sham Kakade

18,795 subscribers

266,585 просмотров • 3 месяцев назад •via X (Twitter)

Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

LLM post-training used to mean fine-tuning to a downstream task Robotics has been stuck in this setting, needing task-specific fine-tuning for best performance π07 changes this: It works out of the box & outperforms fine-tuned specialists Details:

LLM post-training used to mean fine-tuning to a downstream task Robotics has been stuck in this setting, needing task-specific fine-tuning for best performance π07 changes this: It works out of the box & outperforms fine-tuned specialists Details:

Chelsea Finn

61,139 просмотров • 2 месяцев назад

Coming May 14 at Microsoft Research Forum: a new release and demo from MSR AI Frontiers. Plus new work on Agentic GitHub Workflows, Real-time agent verification, Energy-based fine-tuning, and Guiding the AI transition. Register now:

Coming May 14 at Microsoft Research Forum: a new release and demo from MSR AI Frontiers. Plus new work on Agentic GitHub Workflows, Real-time agent verification, Energy-based fine-tuning, and Guiding the AI transition. Register now:

Microsoft Research

17,746,951 просмотров • 2 месяцев назад

New short course on LLMOps! LLMOps (large language model operations) is a rapidly developing field that takes ideas from MLOps (machine learning operations) and specializes them to building and deploying LLM-based applications. In this course, taught by Google Cloud's Erwin Huizenga, you'll learn to use automation to make building, tuning and deploying an LLM-based application less manual and more efficient. You'll learn how to: - Apply supervised fine-tuning to tune an LLM to a specific task - Automate and orchestrate LLM-tuning and deployment by customizing a pre-built tuning pipeline - Apply best practices for preparing training data for supervised fine-tuning of an LLM - Create an LLMOps workflow you can adapt to other LLM-tuning jobs This course doesn't assume any prior MLOps or LLMOps experience. Sign up here to learn about this emerging field!

New short course on LLMOps! LLMOps (large language model operations) is a rapidly developing field that takes ideas from MLOps (machine learning operations) and specializes them to building and deploying LLM-based applications. In this course, taught by Google Cloud's Erwin Huizenga, you'll learn to use automation to make building, tuning and deploying an LLM-based application less manual and more efficient. You'll learn how to: - Apply supervised fine-tuning to tune an LLM to a specific task - Automate and orchestrate LLM-tuning and deployment by customizing a pre-built tuning pipeline - Apply best practices for preparing training data for supervised fine-tuning of an LLM - Create an LLMOps workflow you can adapt to other LLM-tuning jobs This course doesn't assume any prior MLOps or LLMOps experience. Sign up here to learn about this emerging field!

Andrew Ng

221,760 просмотров • 2 лет назад

Our course recommendation of the day is “Post-training of LLMs, ” where you’ll learn how to customize pre-trained language models using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL). You'll learn when to use each method, how to curate training data, and implement them in code to shape model behavior effectively. Enroll here:

Our course recommendation of the day is “Post-training of LLMs, ” where you’ll learn how to customize pre-trained language models using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL). You'll learn when to use each method, how to curate training data, and implement them in code to shape model behavior effectively. Enroll here:

DeepLearning.AI

29,369 просмотров • 8 месяцев назад

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

An exciting new course: Fine-tuning and Reinforcement Learning for LLMs: Intro to Post-training, taught by Sharon Zhou, VP of AI at AMD. Available now at Post-training is the key technique used by frontier labs to turn a base LLM--a model trained on massive unlabeled text to predict the next word/token--into a helpful, reliable assistant that can follow instructions. I've also seen many applications where post-training is what turns a demo application that works only 80% of the time into a reliable system that consistently performs. This course will teach you the most important post-training techniques! In this 5 module course, Sharon walks you through the complete post-training pipeline: supervised fine-tuning, reward modeling, RLHF, and techniques like PPO and GRPO. You'll also learn to use LoRA for efficient training, and to design evals that catch problems before and after deployment. Skills you'll gain: - Apply supervised fine-tuning and reinforcement learning (RLHF, PPO, GRPO) to align models to desired behaviors - Use LoRA for efficient fine-tuning without retraining entire models - Prepare datasets and generate synthetic data for post-training - Understand how to operate LLM production pipelines, with go/no-go decision points and feedback loops These advanced methods aren’t limited to frontier AI labs anymore, and you can now use them in your own applications. Learn here:

Andrew Ng

132,304 просмотров • 8 месяцев назад

StyleDrop: Text-to-Image Generation in Any Style introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. paper page:

StyleDrop: Text-to-Image Generation in Any Style introduce StyleDrop, a method that enables the synthesis of images that faithfully follow a specific style using a text-to-image model. The proposed method is extremely versatile and captures nuances and details of a user-provided style, such as color schemes, shading, design patterns, and local and global effects. It efficiently learns a new style by fine-tuning very few trainable parameters (less than 1% of total model parameters) and improving the quality via iterative training with either human or automated feedback. Better yet, StyleDrop is able to deliver impressive results even when the user supplies only a single image that specifies the desired style. An extensive study shows that, for the task of style tuning text-to-image models, StyleDrop implemented on Muse convincingly outperforms other methods, including DreamBooth and textual inversion on Imagen or Stable Diffusion. paper page:

AK

56,372 просмотров • 3 лет назад

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

New Course: Post-training of LLMs Learn to post-train and customize an LLM in this short course, taught by Banghua Zhu, Assistant Professor at the University of Washington University of Washington, and co-founder of @NexusflowX. Training an LLM to follow instructions or answer questions has two key stages: pre-training and post-training. In pre-training, it learns to predict the next word or token from large amounts of unlabeled text. In post-training, it learns useful behaviors such as following instructions, tool use, and reasoning. Post-training transforms a general-purpose token predictor—trained on trillions of unlabeled text tokens—into an assistant that follows instructions and performs specific tasks. Because it is much cheaper than pre-training, it is practical for many more teams to incorporate post-training methods into their workflows than pre-training. In this course, you’ll learn three common post-training methods—Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL)—and how to use each one effectively. With SFT, you train the model on pairs of input and ideal output responses. With DPO, you provide both a preferred (chosen) and a less preferred (rejected) response and train the model to favor the preferred output. With RL, the model generates an output, receives a reward score based on human or automated feedback, and updates the model to improve performance. You’ll learn the basic concepts, common use cases, and principles for curating high-quality data for effective training. Through hands-on labs, you’ll download a pre-trained model from Hugging Face and post-train it using SFT, DPO, and RL to see how each technique shapes model behavior. In detail, you’ll: - Understand what post-training is, when to use it, and how it differs from pre-training. - Build an SFT pipeline to turn a base model into an instruct model. - Explore how DPO reshapes behavior by minimizing contrastive loss—penalizing poor responses and reinforcing preferred ones. - Implement a DPO pipeline to change the identity of a chat assistant. - Learn online RL methods such as Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO), and how to design reward functions. - Train a model with GRPO to improve its math capabilities using a verifiable reward. Post-training is one of the most rapidly developing areas of LLM training. Whether you’re building a high-accuracy context-specific assistant, fine-tuning a model's tone, or improving task-specific accuracy, this course will give you experience with the most important techniques shaping how LLMs are post-trained today. Please sign up here:

Andrew Ng

125,146 просмотров • 11 месяцев назад

✨ 🖼️ Generate images with consistent characters without any fine tuning nor training. Consistory is a training-free approach to maintaining subject consistency between text-to-image generations on pretrained models. #NVIDIAResearch 🎨 Test it here:

✨ 🖼️ Generate images with consistent characters without any fine tuning nor training. Consistory is a training-free approach to maintaining subject consistency between text-to-image generations on pretrained models. #NVIDIAResearch 🎨 Test it here:

NVIDIA AI Developer

11,600 просмотров • 1 год назад

Excited to share our latest progress on DYNA-1 pre-training! 🤖 The base model now can perform diverse, dexterous tasks (laundry folding, package sorting, …) without any post-training, even in unseen environments. This powerful base also allows extremely efficient fine-tuning to ~100% success on challenging new tasks with as little as 1 hour of data! 🤯 Watch it master two of them: cup stacking & celery chopping on repeat, no failures. 👇

Excited to share our latest progress on DYNA-1 pre-training! 🤖 The base model now can perform diverse, dexterous tasks (laundry folding, package sorting, …) without any post-training, even in unseen environments. This powerful base also allows extremely efficient fine-tuning to ~100% success on challenging new tasks with as little as 1 hour of data! 🤯 Watch it master two of them: cup stacking & celery chopping on repeat, no failures. 👇

Dyna Robotics

66,619 просмотров • 7 месяцев назад

A road repair method that recycles old asphalt directly on-site, creating a new surface without hauling materials.

A road repair method that recycles old asphalt directly on-site, creating a new surface without hauling materials.

Science girl

283,448 просмотров • 7 месяцев назад

$QVAC SDK will support in 0.9.0 (gonna be release in ~10 days) LoRA fine-tuning directly on-device, letting developers customize LLMs with their own data without sending anything to the cloud. You just load a base model, point it at your training dataset, and get a lightweight LoRA adapter back — all running locally. The fine-tuned model can then be used for inference immediately, with no extra setup. Why it matters: LoRA (Low-Rank Adaptation) fine-tuning lets you specialize a general-purpose language model for your specific use case — like matching a brand's tone, mastering domain terminology, or following a particular output format — using a fraction of the compute a full fine-tune would require. QVAC handles the entire workflow locally: dataset preparation, training with configurable hyperparameters, checkpoint saving, and seamless inference with the resulting adapter. Your data never leaves the device. The developer experience: Fine-tuning with QVAC is as simple as calling "sdk.finetune()" with your dataset and a few hyperparameters. Training runs entirely on your local hardware, produces a compact LoRA adapter file, and supports pause/resume so you can stop a job and pick it back up without losing progress. The result plugs straight into QVAC's inference pipeline — no model conversion, no deployment step, just immediate local completions with your fine-tuned model.$

QVAC SDK will support in 0.9.0 (gonna be release in ~10 days) LoRA fine-tuning directly on-device, letting developers customize LLMs with their own data without sending anything to the cloud. You just load a base model, point it at your training dataset, and get a lightweight LoRA adapter back — all running locally. The fine-tuned model can then be used for inference immediately, with no extra setup. Why it matters: LoRA (Low-Rank Adaptation) fine-tuning lets you specialize a general-purpose language model for your specific use case — like matching a brand's tone, mastering domain terminology, or following a particular output format — using a fraction of the compute a full fine-tune would require. QVAC handles the entire workflow locally: dataset preparation, training with configurable hyperparameters, checkpoint saving, and seamless inference with the resulting adapter. Your data never leaves the device. The developer experience: Fine-tuning with QVAC is as simple as calling "sdk.finetune()" with your dataset and a few hyperparameters. Training runs entirely on your local hardware, produces a compact LoRA adapter file, and supports pause/resume so you can stop a job and pick it back up without losing progress. The result plugs straight into QVAC's inference pipeline — no model conversion, no deployment step, just immediate local completions with your fine-tuned model.

Paolo Ardoino 🤖

42,271 просмотров • 2 месяцев назад

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

In case you missed it, we recently launched "Post-training of LLMs," a short course where you'll: ✅ Understand when and why to use post-training methods like Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning. ✅ Learn the concepts underlying the three post-training methods of SFT, DPO, and Online RL, their common use-cases, and how to curate high-quality data to effectively train a model using each method. ✅ Download a pre-trained model and implement post-training pipelines to turn a base model into an instruct model, change the identity of a chat assistant, and improve a model’s math capabilities. Learn more and enroll for free:

DeepLearning.AI

16,771 просмотров • 11 месяцев назад

Remember reinforcement fine-tuning? We’ve been working away at it since last December, and it’s available today with OpenAI o4-mini! RFT uses chain-of-thought reasoning and task-specific grading to improve model performance—especially useful for complex domains. Take Accordance, which used RFT to fine-tune a model that’s SOTA for their tax and accounting purposes. And in supervised fine-tuning news: you can now fine-tune GPT-4.1 nano. Get even more from our fastest, cheapest model by training it specifically for your use-case.

Remember reinforcement fine-tuning? We’ve been working away at it since last December, and it’s available today with OpenAI o4-mini! RFT uses chain-of-thought reasoning and task-specific grading to improve model performance—especially useful for complex domains. Take Accordance, which used RFT to fine-tune a model that’s SOTA for their tax and accounting purposes. And in supervised fine-tuning news: you can now fine-tune GPT-4.1 nano. Get even more from our fastest, cheapest model by training it specifically for your use-case.

OpenAI Developers

663,651 просмотров • 1 год назад

Today we previewed Reinforcement Fine-Tuning, a new model customization technique that enables organizations to build expert models for specific, complex tasks in domains such as coding, scientific research, or finance.

Today we previewed Reinforcement Fine-Tuning, a new model customization technique that enables organizations to build expert models for specific, complex tasks in domains such as coding, scientific research, or finance.

OpenAI

1,072,512 просмотров • 1 год назад

We’re open-sourcing a new tool to control how LLMs behave: k-steering. In just 10 lines of code, you can control multiple aspects of LLM behavior at the same time without any fine-tuning or prompt engineering. Here's how 👇

We’re open-sourcing a new tool to control how LLMs behave: k-steering. In just 10 lines of code, you can control multiple aspects of LLM behavior at the same time without any fine-tuning or prompt engineering. Here's how 👇

Martian

12,067 просмотров • 2 месяцев назад

We pushed MolmoAct 2 into a completely out-of-distribution studio setting to test its robustness under extreme environmental changes. Despite only 10 minutes of task-specific fine-tuning data collected elsewhere, it adapted surprisingly well to the new environment.

We pushed MolmoAct 2 into a completely out-of-distribution studio setting to test its robustness under extreme environmental changes. Despite only 10 minutes of task-specific fine-tuning data collected elsewhere, it adapted surprisingly well to the new environment.

Jiafei Duan

11,381 просмотров • 1 месяц назад

We have news! We created a new robotics model called Loop Model 1. On the zip-tie insertion task, it achieves 20x more throughput per unit of data than "Pi06 + RLT" from Physical Intelligence, a top model for such tasks. It’s the missing piece that makes MicroFactory work, because now deployment becomes so simple and fast that our users can do it themselves.

We have news! We created a new robotics model called Loop Model 1. On the zip-tie insertion task, it achieves 20x more throughput per unit of data than "Pi06 + RLT" from Physical Intelligence, a top model for such tasks. It’s the missing piece that makes MicroFactory work, because now deployment becomes so simple and fast that our users can do it themselves.

Igor Kulakov (MicroFactory)

75,879 просмотров • 1 месяц назад

New short course on Fine-tuning LLMs! Many developers are moving beyond only prompting, to also fine-tuning LLMs - that is, taking a pre-trained model and training it further on your own data, which can deliver superior results inexpensively. In this course, Sharon Zhou, CEO of Lamini (disclosure: I’m a minor shareholder) shows you how to recognize when fine-tuning can be help, and how to train an open-source LLM on your own data. I hope you enjoy the course!

New short course on Fine-tuning LLMs! Many developers are moving beyond only prompting, to also fine-tuning LLMs - that is, taking a pre-trained model and training it further on your own data, which can deliver superior results inexpensively. In this course, Sharon Zhou, CEO of Lamini (disclosure: I’m a minor shareholder) shows you how to recognize when fine-tuning can be help, and how to train an open-source LLM on your own data. I hope you enjoy the course!

Andrew Ng

502,781 просмотров • 2 лет назад

Jump Training vs. Plyometrics If you dig into Yuri Verkhoshansky’s original work, he was very specific in how he framed the shock method, “sharp, compulsory muscular tension caused by the kinetic energy of a falling body.” Over time, as the term spread to the West, the

Jump Training vs. Plyometrics If you dig into Yuri Verkhoshansky’s original work, he was very specific in how he framed the shock method, “sharp, compulsory muscular tension caused by the kinetic energy of a falling body.” Over time, as the term spread to the West, the

Fred Duncan

35,899 просмотров • 9 месяцев назад

Super excited to launch a new AI course! 🚀 Fine-Tuning & Reinforcement Learning for LLMs: Intro to Post-Training A collaboration between AMD 🤝 Andrew Ng’s DeepLearning.AI to give every developer the tools & compute to work with the same post-training techniques, used across today’s leading AI labs. 🎓 Learn for free → 🧵

Super excited to launch a new AI course! 🚀 Fine-Tuning & Reinforcement Learning for LLMs: Intro to Post-Training A collaboration between AMD 🤝 Andrew Ng’s DeepLearning.AI to give every developer the tools & compute to work with the same post-training techniques, used across today’s leading AI labs. 🎓 Learn for free → 🧵

Sharon Zhou

20,386 просмотров • 8 месяцев назад