Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

New research from Databricks: LLMs Can Learn to Reason via Off-Policy RL Optimal Advantage-based Policy Optimization with Lagged Inference policy (OAPL) shows you don’t need strict on-policy training to improve reasoning. It matches or beats Group Relative Policy Optimization (GRPO), stays stable with large policy lag, and uses ~3×... show more

Databricks AI Research

50,151 subscribers

12,539 Aufrufe • vor 4 Monaten •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

Announcing native availability of Anthropic Claude 3.7 Sonnet in Databricks across AWS, Azure and GCP! Customers can now securely access Claude’s advanced reasoning, planning and agentic capabilities directly within Databricks, no setup required. This launch marks the beginning of a strategic partnership between Anthropic and Databricks. We’re excited to see what our customers build and how they leverage domain-specific AI agents!

Announcing native availability of Anthropic Claude 3.7 Sonnet in Databricks across AWS, Azure and GCP! Customers can now securely access Claude’s advanced reasoning, planning and agentic capabilities directly within Databricks, no setup required. This launch marks the beginning of a strategic partnership between Anthropic and Databricks. We’re excited to see what our customers build and how they leverage domain-specific AI agents!

Databricks

12,691 Aufrufe • vor 1 Jahr

Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Does off-policy value-based RL scale? In LLMs, larger scale predictably improves performance. Value-based RL learns from arbitrary data and is sample-efficient, but folk wisdom says it doesn't scale 🧵⬇️We show predictability for scaling value-based RL!

Oleg Rybkin

23,979 Aufrufe • vor 1 Jahr

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

🤔 How to fine-tune an Imitation Learning policy (e.g., Diffusion Policy, ACT) with RL? As an RL practitioner, I’ve been struggling with this problem for a while. Here’s why it’s tough: 1️⃣ Special designs (usually for multimodal action distributions) in modern IL models make them non-trivial to fine-tune by RL. 2️⃣ Large policy models + RL's poor sample efficiency = a nightmare But finally, we figured out a simple solution that works for any model architecture! 🌟 Check out our #ICLR2025 paper: “Policy Decorator: Model-Agnostic Online Refinement for Large Policy Models”, led by my amazing mentee Xiu Yuan. 🔗 🧵 Read more below!

Tongzhou Mu 🤖🦾🦿

16,959 Aufrufe • vor 1 Jahr

Meet #DBRX: a general-purpose LLM that sets a new standard for efficient open source models. Use the DBRX model in your RAG apps or use the DBRX design to build your own custom LLMs and improve the quality of your GenAI applications.

Meet #DBRX: a general-purpose LLM that sets a new standard for efficient open source models. Use the DBRX model in your RAG apps or use the DBRX design to build your own custom LLMs and improve the quality of your GenAI applications.

Databricks

327,704 Aufrufe • vor 2 Jahren

Haven't been to a conference in a while, really excited to be at #NeurIPS2024! I'll be helping present 4 of our group's recent papers: 1. Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL 2. Distributional Successor Features Enable Zero-Shot Policy Optimization 3. Learning to Cooperate with Humans using Generative Agents 4. Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning Find more details on each paper and where to find us in this thread (1/6)

Haven't been to a conference in a while, really excited to be at #NeurIPS2024! I'll be helping present 4 of our group's recent papers: 1. Overcoming the Sim-to-Real Gap: Leveraging Simulation to Learn to Explore for Real-World RL 2. Distributional Successor Features Enable Zero-Shot Policy Optimization 3. Learning to Cooperate with Humans using Generative Agents 4. Personalizing Reinforcement Learning from Human Feedback with Variational Preference Learning Find more details on each paper and where to find us in this thread (1/6)

Abhishek Gupta

10,777 Aufrufe • vor 1 Jahr

How does high-fidelity tactile simulation help robots nail the last millimeter? We’re releasing VT-Refine, accepted to CoRL: a real-to-sim-to-real visuo-tactile policy using a GPU-parallel tactile sim for our piezoresistive skin FlexiTac. Then fine-tuning a diffusion policy with large-scale RL in simulation. Website: #CoRL2025 #RobotLearning #Sim2Real

How does high-fidelity tactile simulation help robots nail the last millimeter? We’re releasing VT-Refine, accepted to CoRL: a real-to-sim-to-real visuo-tactile policy using a GPU-parallel tactile sim for our piezoresistive skin FlexiTac. Then fine-tuning a diffusion policy with large-scale RL in simulation. Website: #CoRL2025 #RobotLearning #Sim2Real

Binghao Huang

46,982 Aufrufe • vor 8 Monaten

Zombie robot RL policy

Zombie robot RL policy

Simon Kalouche

164,401 Aufrufe • vor 1 Jahr

New project! Flow Policy Gradients for Robot Control tldr; a simple online RL recipe for training and fine-tuning flow policies for robots co-led w/ Hongsuk Benjamin Choi:

New project! Flow Policy Gradients for Robot Control tldr; a simple online RL recipe for training and fine-tuning flow policies for robots co-led w/ Hongsuk Benjamin Choi:

Brent Yi

74,149 Aufrufe • vor 4 Monaten

🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL 🧵⬇️

🚨Current scalable RL algos train a policy w/o value func, which is limiting with learning in open-ended, non-stationary, dynamic environments. But, how to scale value-based RL with more data/compute is unclear... Not anymore: presenting scaling laws for value-based RL 🧵⬇️

Aviral Kumar

37,301 Aufrufe • vor 1 Jahr

Netanyahu: My opposition to a Palestinian State is not simply my policy, it is the policy of Israel and its people.

Netanyahu: My opposition to a Palestinian State is not simply my policy, it is the policy of Israel and its people.

Clash Report

53,350 Aufrufe • vor 9 Monaten

Join our mission to advance human-centered AI! Stanford HAI is seeking a new Director of Policy and Society to lead our team in producing critical scholarship and convening global discussions on the intersection of technology, policy, and society.

Join our mission to advance human-centered AI! Stanford HAI is seeking a new Director of Policy and Society to lead our team in producing critical scholarship and convening global discussions on the intersection of technology, policy, and society.

Stanford HAI

32,448 Aufrufe • vor 2 Jahren

Here is a tutorial on training LLaSA (LLaMA-based TTS) using GRPO to improve prosody, rhythm, and expressiveness in synthesized speech with TRL!

Here is a tutorial on training LLaSA (LLaMA-based TTS) using GRPO to improve prosody, rhythm, and expressiveness in synthesized speech with TRL!

steven

15,488 Aufrufe • vor 7 Monaten

When this person uploaded a PDF of the "Excise Policy 2024-25" (tax rules in Haryana), Manus created a visual reference guide with 15 mindmaps that break down the complex policy document into easy-to-understand visuals. Each mindmap provides a visual representation of a specific topic from the policy, making complex information simple to understand and reference. Manus even published the results as a website for convenient viewing and sharing!

When this person uploaded a PDF of the "Excise Policy 2024-25" (tax rules in Haryana), Manus created a visual reference guide with 15 mindmaps that break down the complex policy document into easy-to-understand visuals. Each mindmap provides a visual representation of a specific topic from the policy, making complex information simple to understand and reference. Manus even published the results as a website for convenient viewing and sharing!

ManusAI

29,810 Aufrufe • vor 1 Jahr

ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator Why does recursive reasoning, especially latent reasoning, actually work? The theory is still young, and even mechanistic explanations are limited. We close part of this gap by showing that latent reasoning is secretly doing policy improvement. Each recursion pushes the model steadily toward the target. Based on this view, we propose an algorithm that boosts learning and inference efficiency by up to 18x.

ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator Why does recursive reasoning, especially latent reasoning, actually work? The theory is still young, and even mechanistic explanations are limited. We close part of this gap by showing that latent reasoning is secretly doing policy improvement. Each recursion pushes the model steadily toward the target. Based on this view, we propose an algorithm that boosts learning and inference efficiency by up to 18x.

Arip

24,553 Aufrufe • vor 16 Tagen

Generalist robot policies need a benchmark that works across any robot and any policy. 🦾 Introducing RoboLab, a high‑fidelity simulation benchmark built on NVIDIA Isaac and Omniverse to evaluate generalist robot policies in diverse, photoreal, physics‑based environments. Coming soon to the NVIDIA Isaac Lab‑Arena roadmap for large‑scale, robotic policy evaluation. 📖 #NationalRoboticsWeek

Generalist robot policies need a benchmark that works across any robot and any policy. 🦾 Introducing RoboLab, a high‑fidelity simulation benchmark built on NVIDIA Isaac and Omniverse to evaluate generalist robot policies in diverse, photoreal, physics‑based environments. Coming soon to the NVIDIA Isaac Lab‑Arena roadmap for large‑scale, robotic policy evaluation. 📖 #NationalRoboticsWeek

NVIDIA Robotics

23,872 Aufrufe • vor 2 Monaten

You asked. We listened. Introducing StockX Returns. With our new policy, you can now return eligible orders up to 14 days after delivery for StockX Credit.

You asked. We listened. Introducing StockX Returns. With our new policy, you can now return eligible orders up to 14 days after delivery for StockX Credit.

StockX

134,065 Aufrufe • vor 1 Jahr

The Pope says you can’t vote for Trump because he has a strict immigration policy …do you know who else has a strict immigration policy?

The Pope says you can’t vote for Trump because he has a strict immigration policy …do you know who else has a strict immigration policy?

April Color

10,256 Aufrufe • vor 1 Jahr

“But I wish you could explain to me what the hell’s going on with the mind of the public. Because we have the right policy. [Democrats] don’t. They have horrible policy.” —Trump at the House GOP retreat

“But I wish you could explain to me what the hell’s going on with the mind of the public. Because we have the right policy. [Democrats] don’t. They have horrible policy.” —Trump at the House GOP retreat

Crooked Media

33,122 Aufrufe • vor 5 Monaten