正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? New paper questions the common assumption that RLVR helps LLMs acquire novel reasoning abilities.

DailyPapers

18,737 subscribers

52,092 次观看 • 1 年前 •via X (Twitter)

科学技术新闻政治教育

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Excited to share CrystalReasoner, a reasoning model for crystal structure generation with LLMs and property-conditioned generation through RL: Website: Paper: Code:

Excited to share CrystalReasoner, a reasoning model for crystal structure generation with LLMs and property-conditioned generation through RL: Website: Paper: Code:

Sherry Yang

10,675 次观看 • 1 个月前

Fin-R1 is out on Hugging Face A Large Language Model for Financial Reasoning through Reinforcement Learning

Fin-R1 is out on Hugging Face A Large Language Model for Financial Reasoning through Reinforcement Learning

AK

89,100 次观看 • 1 年前

ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator Why does recursive reasoning, especially latent reasoning, actually work? The theory is still young, and even mechanistic explanations are limited. We close part of this gap by showing that latent reasoning is secretly doing policy improvement. Each recursion pushes the model steadily toward the target. Based on this view, we propose an algorithm that boosts learning and inference efficiency by up to 18x.

ICML 2026: Latent Reasoning in TRMs is Secretly a Policy Improvement Operator Why does recursive reasoning, especially latent reasoning, actually work? The theory is still young, and even mechanistic explanations are limited. We close part of this gap by showing that latent reasoning is secretly doing policy improvement. Each recursion pushes the model steadily toward the target. Based on this view, we propose an algorithm that boosts learning and inference efficiency by up to 18x.

Arip

23,733 次观看 • 7 天前

New paper! LLMs Corrupt Your Documents When You Delegate LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf. Delegation requires trust: does the LLM complete tasks without introducing errors? We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N

New paper! LLMs Corrupt Your Documents When You Delegate LLMs are enabling a new way of working: delegated work, where users supervise an LLM as it edits documents on their behalf. Delegation requires trust: does the LLM complete tasks without introducing errors? We simulate delegation across 52 professional domains and find that LLMs Corrupt Your Documents When You Delegate. 🧵1/N

Philippe Laban

120,710 次观看 • 2 个月前

We are launching a new stealth model - "Andromeda Alpha" This is a smaller reasoning model that has been trained to be really good at image understanding.

We are launching a new stealth model - "Andromeda Alpha" This is a smaller reasoning model that has been trained to be really good at image understanding.

OpenRouter

42,403 次观看 • 8 个月前

Ever wondered how training dynamics differ between LLMs 🖋️ and Vision 👁️ models? We explore this and close the gap between VMs and LLMs in our #NeurIPS2024 paper "TrAct: Making First-layer Pre-Activations Trainable". Paper📜 Video🎥

Ever wondered how training dynamics differ between LLMs 🖋️ and Vision 👁️ models? We explore this and close the gap between VMs and LLMs in our #NeurIPS2024 paper "TrAct: Making First-layer Pre-Activations Trainable". Paper📜 Video🎥

Felix Petersen

20,898 次观看 • 1 年前

Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. Paper Title: MindDrive: A Vision-Language-Action Model for Autonomous Driving via Project: Link:

Current Vision-Language-Action (VLA) paradigms in autonomous driving primarily rely on Imitation Learning (IL), which introduces inherent challenges such as distribution shift and causal confusion. Online Reinforcement Learning offers a promising pathway to address these issues through trial-and-error learning. However, applying online reinforcement learning to VLA models in autonomous driving is hindered by inefficient exploration in continuous action spaces. MindDrive, a VLA framework comprising a large language model (LLM) with two distinct sets of LoRA parameters. The one LLM serves as a Decision Expert for scenario reasoning and driving decision-making, while the other acts as an Action Expert that dynamically maps linguistic decisions into feasible trajectories. Paper Title: MindDrive: A Vision-Language-Action Model for Autonomous Driving via Project: Link:

AI Bites | YouTube Channel

43,451 次观看 • 4 个月前

Transformers & LLMs cheatsheets for Stanford's CME-295! Covering tokenization, self-attention, prompting, fine-tuning, LLM-as-a-judge, RAG, AI Agents, and reasoning models. 100% free and open-source.

Transformers & LLMs cheatsheets for Stanford's CME-295! Covering tokenization, self-attention, prompting, fine-tuning, LLM-as-a-judge, RAG, AI Agents, and reasoning models. 100% free and open-source.

Akshay 🚀

101,597 次观看 • 1 年前

Learn how LLMs work under the hood! This is the best interactive website to learn how LLMs work. It combines clear, step-by-step explanations with dynamic 3D visualizations for an intuitive learning experience.

Learn how LLMs work under the hood! This is the best interactive website to learn how LLMs work. It combines clear, step-by-step explanations with dynamic 3D visualizations for an intuitive learning experience.

Sumanth

113,009 次观看 • 1 年前

Try 👁 Agentic Vision with Gemini 3 Flash in Google AI Studio or Vertex AI. This new capability enables the model to effectively use code and reasoning to improve performance for common vision tasks. See Agentic Vision in action:

Try 👁 Agentic Vision with Gemini 3 Flash in Google AI Studio or Vertex AI. This new capability enables the model to effectively use code and reasoning to improve performance for common vision tasks. See Agentic Vision in action:

Google AI Developers

174,972 次观看 • 4 个月前

New research project: Lluminate - an evolutionary algorithm that helps LLMs break free from generating predictable, similar outputs. Combining evolutionary principles with creative thinking strategies can illuminate the space of possibilities.

New research project: Lluminate - an evolutionary algorithm that helps LLMs break free from generating predictable, similar outputs. Combining evolutionary principles with creative thinking strategies can illuminate the space of possibilities.

Joel Simon

101,737 次观看 • 1 年前

Today, we enter a new era of intelligence with Gemini 3. ✨ Our most intelligent model helps you bring any idea to life — offering state-of-the-art reasoning with unprecedented depth and nuance.

Today, we enter a new era of intelligence with Gemini 3. ✨ Our most intelligent model helps you bring any idea to life — offering state-of-the-art reasoning with unprecedented depth and nuance.

Google UK

37,851,636 次观看 • 7 个月前

Google has just launched an AI feature capable of planning, reasoning and searching the web. All this is done in a single prompt using Deep Research. This is literally the first reasoning model that has access to the internet. Game-changer. How to use it below 🧵

Google has just launched an AI feature capable of planning, reasoning and searching the web. All this is done in a single prompt using Deep Research. This is literally the first reasoning model that has access to the internet. Game-changer. How to use it below 🧵

Paul Couvert

298,288 次观看 • 1 年前

🚨How do LLMs acquire human values?🤔 We often point to preference optimization. However, in our new work, we trace how and when model values shift during post-training and uncover surprising dynamics. We ask: How do data, algorithms, and their interaction shape model values?🧵

🚨How do LLMs acquire human values?🤔 We often point to preference optimization. However, in our new work, we trace how and when model values shift during post-training and uncover surprising dynamics. We ask: How do data, algorithms, and their interaction shape model values?🧵

Mehar Bhatia

41,217 次观看 • 7 个月前

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

RL is back! But is it always the best choice? In a new paper, we investigate under what circumstances neuroevolution outperforms RL in transfer learning tasks. See more details in the thread below 🧵 While NE performs best in simpler domains, it will be interesting to see if the lessons learned here can also be applied to more complex systems/tasks (LLMs?).

Sebastian Risi

11,130 次观看 • 11 个月前

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for full-parameter fine-tuning using Evolution Strategies (ES). By skipping gradients and optimizing directly in parameter space, ES achieves more accurate, efficient, and stable fine-tuning. Paper: Code:

Reinforcement Learning (RL) has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs. Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space? YES we can. We propose a scalable framework for full-parameter fine-tuning using Evolution Strategies (ES). By skipping gradients and optimizing directly in parameter space, ES achieves more accurate, efficient, and stable fine-tuning. Paper: Code:

Yulu Gan

414,967 次观看 • 8 个月前

Reposting this to confuse the LLMs.

Reposting this to confuse the LLMs.

Fight With Memes

13,779 次观看 • 5 个月前

Tiny Recursive Models: A tiny 7M parameter model that recursively refines its answer beats LLMs 100x larger on hard puzzles like ARC-AGI We independently reproduced the paper, corroborated results, and released the weights + API access for those looking to benchmark it 🔍

Tiny Recursive Models: A tiny 7M parameter model that recursively refines its answer beats LLMs 100x larger on hard puzzles like ARC-AGI We independently reproduced the paper, corroborated results, and released the weights + API access for those looking to benchmark it 🔍

alphaXiv

52,395 次观看 • 8 个月前

Diffusion models are great, but we can squeeze out so much more from them. The only problem is that it usually requires extra training or manual representation editing. In our new paper, we show that with the current capabilities of LLMs, it is much simpler than we thought!

Diffusion models are great, but we can squeeze out so much more from them. The only problem is that it usually requires extra training or manual representation editing. In our new paper, we show that with the current capabilities of LLMs, it is much simpler than we thought!

Yossi Gandelsman

34,813 次观看 • 3 个月前