正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Today we previewed Reinforcement Fine-Tuning, a new model customization technique that enables organizations to build expert models for specific, complex tasks in domains such as coding, scientific research, or finance.

OpenAI

5,018,930 subscribers

1,072,605 次观看 • 1 年前 •via X (Twitter)

教育新闻政治科学技术

Anya Rossi• Live Now

Private livecam show

9 条评论

OpenAI 的头像

OpenAI1 年前

We’re expanding alpha access to researchers, universities, and enterprises through our Reinforcement Fine-Tuning Research Program. Spots are limited—apply now.

### 的头像

###1 年前

Day 1 - $200 a month Day 2 - Something not actually available. Why does this expressly not feel like 12 Days of Christmas when that's what it was trying to bill itself as?

elvis 的头像

elvis1 年前

Great stuff and exciting to see the use of RFT to tune more powerful custom domain models. TL;DR for who is interested:

Legs Benedict 的头像

Legs Benedict1 年前

2 out of 12 days have been announcing things for organisations...

$Q*🍓on Ethereum 的头像

$Q*🍓on Ethereum1 年前

Science models coming

Spencer Hakimian 的头像

Spencer Hakimian1 年前

Going to be a game changer in portfolio backtesting for financial firms.

AK 的头像

AK1 年前

awesome, also try out chatgpt and much more here:

Pyters 的头像

Pyters1 年前

OpenAI has introduced Reinforcement Fine-Tuning (RFT), a new technique designed to enhance AI model performance in specialized domains like coding, scientific research, and finance.

Muratcan Koylan 的头像

Muratcan Koylan1 年前

Releasing a method you use to fine-tune your frontier models is absolutely fantastic—hands down, a great initiative. Thank you! I’m excited about the opportunity to be part of this research program, hopefully.

相关视频

Remember reinforcement fine-tuning? We’ve been working away at it since last December, and it’s available today with OpenAI o4-mini! RFT uses chain-of-thought reasoning and task-specific grading to improve model performance—especially useful for complex domains. Take Accordance, which used RFT to fine-tune a model that’s SOTA for their tax and accounting purposes. And in supervised fine-tuning news: you can now fine-tune GPT-4.1 nano. Get even more from our fastest, cheapest model by training it specifically for your use-case.

Remember reinforcement fine-tuning? We’ve been working away at it since last December, and it’s available today with OpenAI o4-mini! RFT uses chain-of-thought reasoning and task-specific grading to improve model performance—especially useful for complex domains. Take Accordance, which used RFT to fine-tune a model that’s SOTA for their tax and accounting purposes. And in supervised fine-tuning news: you can now fine-tune GPT-4.1 nano. Get even more from our fastest, cheapest model by training it specifically for your use-case.

OpenAI Developers

663,794 次观看 • 1 年前

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,457 次观看 • 1 年前

Super excited to share 🧠MLGym 🦾 – the first Gym environment for AI Research Agents 🤖🔬 We introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. The key contributions of our work are: 🕹️ Enables the exploration of different training algorithms for AI Research Agents such as RL 🛠️ Provides a flexible evaluation framework that can accommodate different artifacts such as models, algorithms, or predictions 🤖 Allows researchers to evaluate any model without the need to develop a custom agentic harness 🎯 Introduces 13 diverse open-ended AI Research tasks for evaluating AI Research Agents on a wide range of domains such as computer vision, natural language processing, reinforcement learning, game theory, and logical reasoning. 📈 Proposes a new evaluation metric for AI Research Agents MLGym makes it easy to: 1) Add new tasks 2) Evaluate new models 3) Integrate new agents Check out a video of the MLGym Agent to see how it performs the full pipeline of idea generation💡, implementation 👩‍💻, experimentation 👩‍🔬, and iteration 🔄 to improve on ML tasks. Huge thanks to the exceptionally talented Deepak Nathani who led this work and to all the other amazing collaborators who made this possible 🙏🫶🚀

Super excited to share 🧠MLGym 🦾 – the first Gym environment for AI Research Agents 🤖🔬 We introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. The key contributions of our work are: 🕹️ Enables the exploration of different training algorithms for AI Research Agents such as RL 🛠️ Provides a flexible evaluation framework that can accommodate different artifacts such as models, algorithms, or predictions 🤖 Allows researchers to evaluate any model without the need to develop a custom agentic harness 🎯 Introduces 13 diverse open-ended AI Research tasks for evaluating AI Research Agents on a wide range of domains such as computer vision, natural language processing, reinforcement learning, game theory, and logical reasoning. 📈 Proposes a new evaluation metric for AI Research Agents MLGym makes it easy to: 1) Add new tasks 2) Evaluate new models 3) Integrate new agents Check out a video of the MLGym Agent to see how it performs the full pipeline of idea generation💡, implementation 👩‍💻, experimentation 👩‍🔬, and iteration 🔄 to improve on ML tasks. Huge thanks to the exceptionally talented Deepak Nathani who led this work and to all the other amazing collaborators who made this possible 🙏🫶🚀

Roberta Raileanu

105,041 次观看 • 1 年前

OpenAI’s newest model is finally here: o1. o1 represents an entirely new class of models designed to reason or “think through” complex problems— and it's already making huge leaps in domains like math and coding. For the very first episode of YC Decoded, we took a look inside.

OpenAI’s newest model is finally here: o1. o1 represents an entirely new class of models designed to reason or “think through” complex problems— and it's already making huge leaps in domains like math and coding. For the very first episode of YC Decoded, we took a look inside.

Y Combinator

92,680 次观看 • 1 年前

We got our robots to wash pans, clean windows, make peanut butter sandwiches, and more! Fine-tuning our latest model enables all of these tasks, and this has interesting implications for robotics, Moravec's paradox, and the future of large models in embodied AI. More below!

We got our robots to wash pans, clean windows, make peanut butter sandwiches, and more! Fine-tuning our latest model enables all of these tasks, and this has interesting implications for robotics, Moravec's paradox, and the future of large models in embodied AI. More below!

Physical Intelligence

543,343 次观看 • 7 个月前

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

We developed an RL method for fine-tuning our models for precise tasks in just a few hours or even minutes. Instead of training the whole model, we add an “RL token” output to π-0.6, our latest model, which is used by a tiny actor and critic to learn quickly with RL.

Physical Intelligence

435,870 次观看 • 4 个月前

GPTs are a new way for anyone to create a tailored version of ChatGPT to be more helpful in their daily life, at specific tasks, at work, or at home — and then share that creation with others. No code required.

GPTs are a new way for anyone to create a tailored version of ChatGPT to be more helpful in their daily life, at specific tasks, at work, or at home — and then share that creation with others. No code required.

OpenAI

2,110,369 次观看 • 2 年前

What if robots could improve themselves by learning from their own failures in the real-world? Introducing 𝗣𝗟𝗗 (𝗣𝗿𝗼𝗯𝗲, 𝗟𝗲𝗮𝗿𝗻, 𝗗𝗶𝘀𝘁𝗶𝗹𝗹) — a recipe that enables Vision-Language-Action (VLA) models to self-improve for high-precision manipulation tasks. PLD couples real-world residual reinforcement learning with standard supervised fine-tuning — letting robots discover, recover, and distill their own data flywheel. Quick 🧵

What if robots could improve themselves by learning from their own failures in the real-world? Introducing 𝗣𝗟𝗗 (𝗣𝗿𝗼𝗯𝗲, 𝗟𝗲𝗮𝗿𝗻, 𝗗𝗶𝘀𝘁𝗶𝗹𝗹) — a recipe that enables Vision-Language-Action (VLA) models to self-improve for high-precision manipulation tasks. PLD couples real-world residual reinforcement learning with standard supervised fine-tuning — letting robots discover, recover, and distill their own data flywheel. Quick 🧵

Wenli Xiao

185,017 次观看 • 8 个月前

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,456 次观看 • 1 年前

We introduce a system for fine-grained robotic manipulation! 🤖 What’s new? * We can control cheap robots to do surprisingly dexterous tasks * New technique that allows robots to learn fine motor skills A short thread 🧵

We introduce a system for fine-grained robotic manipulation! 🤖 What’s new? * We can control cheap robots to do surprisingly dexterous tasks * New technique that allows robots to learn fine motor skills A short thread 🧵

Chelsea Finn

264,399 次观看 • 3 年前

Unsloth AI and NVIDIA are Revolutionizing Local LLM Fine-Tuning: From RTX Desktops to DGX Spark Fine-tune popular AI models faster with Unsloth on NVIDIA RTX AI PCs such as GeForce RTX desktops and laptops to RTX PRO workstations and the new DGX Spark to build personalized assistants for coding, creative work, and complex agentic workflows. The landscape of modern AI is shifting. We are moving away from a total reliance on massive, generalized cloud models and entering the era of local, agentic AI. Whether it is tuning a chatbot to handle hyper-specific product support or building a personal assistant that manages intricate schedules, the potential for generative AI on local hardware is boundless. However, developers face a persistent bottleneck: How do you get a Small Language Model (SLM) to punch above its weight class and respond with high accuracy for specialized tasks? The answer is Fine-Tuning, and the tool of choice is Unsloth. Unsloth provides an easy and high-speed method to customize models. Optimized for efficient, low-memory training on NVIDIA GPUs, Unsloth scales effortlessly from GeForce RTX desktops and laptop all the way to the DGX Spark, the world’s smallest AI supercomputer...... Full analysis: NVIDIA NVIDIA AI NVIDIA AIDev NVIDIAnewsroom Unsloth AI Unsloth

Unsloth AI and NVIDIA are Revolutionizing Local LLM Fine-Tuning: From RTX Desktops to DGX Spark Fine-tune popular AI models faster with Unsloth on NVIDIA RTX AI PCs such as GeForce RTX desktops and laptops to RTX PRO workstations and the new DGX Spark to build personalized assistants for coding, creative work, and complex agentic workflows. The landscape of modern AI is shifting. We are moving away from a total reliance on massive, generalized cloud models and entering the era of local, agentic AI. Whether it is tuning a chatbot to handle hyper-specific product support or building a personal assistant that manages intricate schedules, the potential for generative AI on local hardware is boundless. However, developers face a persistent bottleneck: How do you get a Small Language Model (SLM) to punch above its weight class and respond with high accuracy for specialized tasks? The answer is Fine-Tuning, and the tool of choice is Unsloth. Unsloth provides an easy and high-speed method to customize models. Optimized for efficient, low-memory training on NVIDIA GPUs, Unsloth scales effortlessly from GeForce RTX desktops and laptop all the way to the DGX Spark, the world’s smallest AI supercomputer...... Full analysis: NVIDIA NVIDIA AI NVIDIA AIDev NVIDIAnewsroom Unsloth AI Unsloth

Marktechpost AI Dev News ⚡

31,551 次观看 • 7 个月前

GPT-5.1 is now live in Augment Code. It's our strongest model yet for complex reasoning tasks, such as identifying and fixing bugs or complex multi-file edits. Rolling out to users now. We’re excited for you to try it!

GPT-5.1 is now live in Augment Code. It's our strongest model yet for complex reasoning tasks, such as identifying and fixing bugs or complex multi-file edits. Rolling out to users now. We’re excited for you to try it!

Augment Code

67,211 次观看 • 8 个月前

A research preview of Operator, an agent that can use its own browser to perform tasks for you.

A research preview of Operator, an agent that can use its own browser to perform tasks for you.

OpenAI

3,937,419 次观看 • 1 年前

For the first time in human history, we are teaching a Foundation Model to master the diverse tasks of medicinal chemists, biologists, and computational scientists all in one place. In our latest collaboration with Liquid AI, we are moving away from fragmented, specialized tools toward a single, super-intelligent model. What surprised me most? This model isn't just performing at reasonable levels—it has started outperforming specialist models across physics-based tasks, imaging, and longitudinal data. Why this changes everything: -Synergy over Specialization: Fine-tuning on specific tasks has unlocked unexpected capabilities in synergetic areas, opening a new frontier in multimodal AI research. -Zero-Shot Potential: We are building a model that can perform out-of-scope tasks, moving us closer to an "AI deity" for drug discovery. -Quality First: The goal isn't just to bypass regulations to save time; it’s about using these synergies to develop better, more effective drugs. We are no longer just looking at linear regression or simple text; we are looking at the future of how humanity fights disease. #LiquidAI #InsilicoMedicine #GenerativeAI #DrugDiscovery #DeepTech #BiotechInnovation

For the first time in human history, we are teaching a Foundation Model to master the diverse tasks of medicinal chemists, biologists, and computational scientists all in one place. In our latest collaboration with Liquid AI, we are moving away from fragmented, specialized tools toward a single, super-intelligent model. What surprised me most? This model isn't just performing at reasonable levels—it has started outperforming specialist models across physics-based tasks, imaging, and longitudinal data. Why this changes everything: -Synergy over Specialization: Fine-tuning on specific tasks has unlocked unexpected capabilities in synergetic areas, opening a new frontier in multimodal AI research. -Zero-Shot Potential: We are building a model that can perform out-of-scope tasks, moving us closer to an "AI deity" for drug discovery. -Quality First: The goal isn't just to bypass regulations to save time; it’s about using these synergies to develop better, more effective drugs. We are no longer just looking at linear regression or simple text; we are looking at the future of how humanity fights disease. #LiquidAI #InsilicoMedicine #GenerativeAI #DrugDiscovery #DeepTech #BiotechInnovation

Alex Zhavoronkov, PhD (aka Aleksandrs Zavoronkovs)

10,544 次观看 • 3 个月前

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Small Language Models (SML) are the future of AI. "Small" (SML) instead of "Large" (LLM). These small models are highly specialized models with superhuman abilities on specific tasks. Here are two techniques to build these models: • Spectrum • Model Merging I give you a short introduction in the attached video, but here is a quick summary: Spectrum helps us identify the most relevant layers to solve one specific task. We can ignore everything else and focus on fine-tuning these layers. Using Spectrum, we can fine-tune models in a heartbeat. Model Merging combines multiple models into a unique, much better model than any of the individual input models. You can also combine models specialized in different tasks and get a model with multiple abilities. This is the state of the art of productizing models. It's what Arcee.ai's platform does behind the scenes. Arcee collaborated with me on this post and is sponsoring it. There are three main steps to produce a model for your particular use case: 1. You create a dataset by uploading your data. 2. You train a model. At this step, Arcee uses Spectrum and Model Merging to produce a highly specialized model for your task. 3. You can deploy that model to any environment you want. Three important notes: • Training process is 2x faster and 2x cheaper than regular fine-tuning. • Resultant models are smaller and have higher accuracy. • They create these specialized models from open-source models. Check this site so you can fully appreciate how this works: If you want to fine-tune an open-source model, consider Arcee's platform. This is the state of the art.

Santiago

164,162 次观看 • 2 年前

GPT-5.5 is here. It’s our smartest frontier model yet, introducing a new class of intelligence for agentic coding, computer use, knowledge work, and scientific research. Rolling out in ChatGPT and Codex today. API is coming soon.

GPT-5.5 is here. It’s our smartest frontier model yet, introducing a new class of intelligence for agentic coding, computer use, knowledge work, and scientific research. Rolling out in ChatGPT and Codex today. API is coming soon.

OpenAI Developers

601,175 次观看 • 3 个月前

Revolutionizing Move Programming with OpenLedger In this demo, we showcase how Move datasets contributed by data providers to OpenLedger’s datanets are used to fine-tune specialized models with LoRA fine-tuning. As seen in the video, we showcase an example on how builders can deploy a Move-specialized model that powers Co-pilot agents using our no-code model fine-tuning platform. This is the future of AI and Web3 innovation. Watch this space to see more specialised models and data feeds being built for next generation agents on top of OpenLedger #Move

Revolutionizing Move Programming with OpenLedger In this demo, we showcase how Move datasets contributed by data providers to OpenLedger’s datanets are used to fine-tune specialized models with LoRA fine-tuning. As seen in the video, we showcase an example on how builders can deploy a Move-specialized model that powers Co-pilot agents using our no-code model fine-tuning platform. This is the future of AI and Web3 innovation. Watch this space to see more specialised models and data feeds being built for next generation agents on top of OpenLedger #Move

OpenLedger

61,662 次观看 • 1 年前

📁 Andrew Ng, cofounder of Coursera, says that agentic workflows let AI work iteratively to produce better outcomes than generating text in one pass. He notes that this approach is more effective for complex tasks such as advising, compliance or coding. He concludes that we should not wait for AGI but use what today’s technology can already deliver in the coming months.

📁 Andrew Ng, cofounder of Coursera, says that agentic workflows let AI work iteratively to produce better outcomes than generating text in one pass. He notes that this approach is more effective for complex tasks such as advising, compliance or coding. He concludes that we should not wait for AGI but use what today’s technology can already deliver in the coming months.

Jon Hernandez

19,292 次观看 • 8 个月前

Introducing NVIDIA Nemotron 3 Ultra. A frontier smart open model built for long-running agents that need to plan, reason, use tools and keep working across complex coding, research and enterprise workflows. Up to 5x faster inference and up to 30% lower cost for agentic tasks. Learn more:

Introducing NVIDIA Nemotron 3 Ultra. A frontier smart open model built for long-running agents that need to plan, reason, use tools and keep working across complex coding, research and enterprise workflows. Up to 5x faster inference and up to 30% lower cost for agentic tasks. Learn more:

NVIDIA

228,735 次观看 • 1 个月前

Today, we're introducing the new vibe coding experience in Google AI Studio!🔥 We want to make it as easy as possible for you to build AI apps :) Short demo of what's new:

Today, we're introducing the new vibe coding experience in Google AI Studio!🔥 We want to make it as easy as possible for you to build AI apps :) Short demo of what's new:

Patrick Loeber

77,772 次观看 • 9 个月前