Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Introducing Reinforcement-Learned Teachers (RLTs): Transforming how we teach LLMs to reason with reinforcement learning (RL). Blog: Paper: Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting. Enter our... RLTs—a new class of models prompted with not only a problem’s question but also its solution, and directly trained to generate clear, step-by-step “explanations” to teach their students. Remarkably, an RLT with only 7B parameters produces superior results when distilling and cold-starting students in competitive and graduate-level reasoning tasks than orders-of-magnitude larger LLMs. RLTs are as effective even when distilling 32B students, much larger than the teacher itself—unlocking a new standard for efficiency in developing reasoning language models with RL. Code:show more

Sakana AI

130,232 subscribers

179,276 görüntüleme • 1 yıl önce •via X (Twitter)

Sağlık & İyilik Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

9 Yorum

iMuffin profil fotoğrafı

iMuffin1 yıl önce

Another banger from Sakana. You guys are dropping fantastic papers every week! I can't keep up.

DALNK ʕ •ᴥ•ʔつ━☆ profil fotoğrafı

DALNK ʕ •ᴥ•ʔつ━☆1 yıl önce

Surprised it took this long for people to find this trick

Zinan Lin profil fotoğrafı

Zinan Lin1 yıl önce

Great work! Just to share — in our NeurIPS 2024 paper (arXiv’ed a year ago), we also proposed the Learning by Teaching idea and showed how it can enhance LLMs during both training and prompting.

Pehdrew profil fotoğrafı

Pehdrew1 yıl önce

🧐

Kabir Shaurya profil fotoğrafı

Kabir Shaurya1 yıl önce

Great drop

R✨ profil fotoğrafı

R✨1 yıl önce

👍

Xhris57 profil fotoğrafı

Xhris571 yıl önce

Thank you for sharing. Here’s a distilled summary of the Reinforcement-Learned Teachers (RLTs) concept and its significance: ⸻ 🧠 Reinforcement-Learned Teachers (RLTs) Overview 🔹 What Are RLTs? RLTs are a new class of LLM-based teacher models trained via reinforcement learning (RL) to produce high-quality, step-by-step explanations for reasoning tasks. Unlike standard RL-trained solvers, RLTs focus on teaching—not just solving. ⸻ 🔹 How RLTs Work •They’re prompted with both a problem and its solution. •They’re trained to generate interpretable reasoning paths rather than just answers. •The goal is to make them effective at distilling this reasoning into student models. ⸻ 🚀 Key Innovations & Results 1.Teaching Over Solving: RLTs don’t just learn to solve tasks—they learn how to explain them clearly for downstream distillation. 2.Efficiency Gains: A 7B parameter RLT can outperform much larger LLMs (like 70B) in teaching effectiveness, particularly in distilling reasoning ability into smaller student models. 3.Cold Start Generalization: RLTs can initialize student models with no prior task exposure—significantly improving cold-start training scenarios. 4.Teacher Smaller Than Student: A 7B RLT can successfully distill into a 32B student, challenging assumptions about size-based hierarchy in knowledge transfer. ⸻ 📊 Performance Impact •Superior results in competitive reasoning benchmarks •Enhanced training of student LLMs on step-by-step reasoning tasks •Opens up new paths for RL-driven education paradigms in AI development ⸻ 🧩 Implications •More interpretable AI: Teaching-based RL improves clarity over black-box solutioning. •Efficient scaling: Smaller models can teach larger ones—reducing compute needs. •Better alignment and control: RLTs allow for more structured reasoning supervision via RL. ⸻ 🔗 Resources •📜 Paper •🧑‍💻 Code •📝 Blog ⸻ Let me know if you’d like a diagrammatic summary, use-case extrapolation, or integration suggestion into your own symbolic framework.

Emily profil fotoğrafı

Emily1 yıl önce

Good update 👌

Shashank Jain profil fotoğrafı

Shashank Jain1 yıl önce

excellent work..Just had one doubt..Do we use the teacher feedback along with reward to fine tune the student or its just reward?

Benzer Videolar

NeurIPS 2025 Paper: LLMs are Reinforcement Learners 🤯! Surprisingly, we show that LLMs can solve RL tasks without any external component! We introduce Prompted Policy Search (ProPS), an RL method based only LLMs and in-context learning. [Paper]

NeurIPS 2025 Paper: LLMs are Reinforcement Learners 🤯! Surprisingly, we show that LLMs can solve RL tasks without any external component! We introduce Prompted Policy Search (ProPS), an RL method based only LLMs and in-context learning. [Paper]

Heni Ben Amor

51,248 görüntüleme • 7 ay önce

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

New Course: Reinforcement Fine-Tuning LLMs with GRPO! Learn to use reinforcement learning to improve your LLM performance in this short course, built in collaboration with Predibase by Rubrik, and taught by Travis Addair, its Co-Founder and CTO, and Arnav Garg, its Senior Engineer and Machine Learning Lead. Reasoning models have been one of the most important developments in LLMs. Reinforcement Fine-Tuning (RFT) uses rewards to encourage LLMs to find solutions to multi-step reasoning tasks such as solving math problems and debugging code - without needing pre-existing training examples like in traditional supervised fine-tuning. Group Relative Policy Optimization (GRPO) is a reinforcement fine-tuning algorithm gaining rapid adoption. Developed by the DeepSeek team and used to train the R1 reasoning model, GRPO uses reward functions that you can write in Python to assign rewards to model responses. It’s beneficial for tasks with verifiable outcomes and can work well even with fewer than 100 training examples. It can also significantly improve the reasoning ability of smaller LLMs, making applications faster and more cost effective. In this course, you’ll take a technical deep dive into RFT with GRPO. You’ll learn to build reward functions that you can use in the GRPO training process to guide an LLM toward better performance on multi-step reasoning tasks. In detail, you’ll: - Learn when reinforcement fine-tuning is a better fit than supervised fine-tuning, especially for tasks involving multi-step reasoning or limited labeled data. - Understand how GRPO uses programmable reward functions as a more scalable alternative to the human feedback required for other reinforcement learning algorithms, such as RLHF and DPO. - Frame the Wordle game as a reinforcement fine-tuning problem and see how an LLM can learn to plan, analyze feedback, and improve its strategy over time. - Design reward functions that power the reinforcement fine-tuning process. - Learn techniques for evaluating more subjective tasks, such as rating the quality of a text summary, using an LLM as a judge. - Understand why reward hacking happens and how to avoid it by adding penalty functions to discourage undesirable behaviors. - Learn the four key components of the loss calculation in the GRPO algorithm: token probability distribution ratios, advantages, clipping, and KL-divergence. - Launch reinforcement fine-tuning jobs using Predibase’s hosted training services. By the end of this course, you’ll be able to build and fine-tune LLMs using reinforcement learning to improve reasoning without relying on large labeled datasets or subjective human feedback. Please sign up here:

Andrew Ng

86,442 görüntüleme • 1 yıl önce

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Robot AI brains, aka Vision-Language-Action models, cannot adapt to new tasks as easily as LLMs like Gemini, ChatGPT, or Grok. LLMs can adapt quickly with their in-context learning (ICL) capabilities. But can we inject ICL abilities into a pre-trained VLA like pi0? Yes! Introducing RICL (Retraining for In-Context Learning), our Conference on Robot Learning (CoRL) 2025 paper. Our RICL-pi0 model can adapt to unseen objects, novel motions, and new scenes with just ICL and RAG (retrieval-augmented generation). RICL-pi0 also boosts performance on the long-tail of tasks. A quick 1 minute video summary:

Kaustubh Sridhar

52,158 görüntüleme • 10 ay önce

Very Powerful but very simple Prompting technique. Simply ask the LLM to re-read the question - and this significantly boosts LLM reasoning across diverse tasks and model types. 💡 Repeats question input twice in prompt, unlocks latent reasoning potential **Original Problem** 🤔: Decoder-only LLMs with unidirectional attention struggle with nuanced reasoning tasks due to limited global understanding of input questions. **Key Insights from this Paper **💡: • Re-reading (RE2) input enhances reasoning by improving question comprehension • Enables "bidirectional" understanding in unidirectional LLMs • Compatible with existing thought-eliciting prompting methods • Effective across various LLM types and reasoning tasks

Very Powerful but very simple Prompting technique. Simply ask the LLM to re-read the question - and this significantly boosts LLM reasoning across diverse tasks and model types. 💡 Repeats question input twice in prompt, unlocks latent reasoning potential Original Problem 🤔: Decoder-only LLMs with unidirectional attention struggle with nuanced reasoning tasks due to limited global understanding of input questions. Key Insights from this Paper 💡: • Re-reading (RE2) input enhances reasoning by improving question comprehension • Enables "bidirectional" understanding in unidirectional LLMs • Compatible with existing thought-eliciting prompting methods • Effective across various LLM types and reasoning tasks

Rohan Paul

169,175 görüntüleme • 1 yıl önce

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Hiring RL Engineer! Started off as a curious project at Lossfunk to push the boundaries of LLMs in social reasoning - we are now building RL environments, data, and benchmarks to simulate more real-world scenarios. If you want to train SoTA RL models over multi-GPUs (H200s/B200s) to unlock next AI frontier, this is for you.

Satpal Singh Rathore

45,915 görüntüleme • 10 ay önce

Reasoning LLMs Guide [Full Video - Unedited] 1 hr talk on reasoning LLMs and how to best use them for different applications. Share with your devs & students. I discuss lots of fun ideas like meta-prompting, LLM-as-a-Judge, use cases, prompting tips, and much more.

Reasoning LLMs Guide [Full Video - Unedited] 1 hr talk on reasoning LLMs and how to best use them for different applications. Share with your devs & students. I discuss lots of fun ideas like meta-prompting, LLM-as-a-Judge, use cases, prompting tips, and much more.

elvis

61,139 görüntüleme • 1 yıl önce

Our course recommendation of the day is “Post-training of LLMs, ” where you’ll learn how to customize pre-trained language models using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL). You'll learn when to use each method, how to curate training data, and implement them in code to shape model behavior effectively. Enroll here:

Our course recommendation of the day is “Post-training of LLMs, ” where you’ll learn how to customize pre-trained language models using Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Online Reinforcement Learning (RL). You'll learn when to use each method, how to curate training data, and implement them in code to shape model behavior effectively. Enroll here:

DeepLearning.AI

29,369 görüntüleme • 8 ay önce

Last semester, I taught Reinforcement Learning class again at UCLA. Together with my amazing TAs Matthew and Caiyuan, we built a mini-project: MetaDrive Arena 🚗🤖 Students applied what they learned in class, trained RL agents, and competed on a live leaderboard. The results were incredible, with 94 agents, 2K submissions, and 130K matches. We saw tons of creativity, clever ideas, and real progress in learning. We’re now releasing it publicly to support RL education and experimentation. Try it out and train your own agent at 🔗

Last semester, I taught Reinforcement Learning class again at UCLA. Together with my amazing TAs Matthew and Caiyuan, we built a mini-project: MetaDrive Arena 🚗🤖 Students applied what they learned in class, trained RL agents, and competed on a live leaderboard. The results were incredible, with 94 agents, 2K submissions, and 130K matches. We saw tons of creativity, clever ideas, and real progress in learning. We’re now releasing it publicly to support RL education and experimentation. Try it out and train your own agent at 🔗

Bolei Zhou

25,231 görüntüleme • 2 ay önce

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models paper page: github: Recent advancements in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, these models still encounter difficulties when generating images from prompts that demand spatial or common sense reasoning. We propose to equip diffusion models with enhanced reasoning capabilities by using off-the-shelf pretrained large language models (LLMs) in a novel two-stage generation process. First, we adapt an LLM to be a text-guided layout generator through in-context learning. When provided with an image prompt, an LLM outputs a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we steer a diffusion model with a novel controller to generate images conditioned on the layout. Both stages utilize frozen pretrained models without any LLM or diffusion model parameter optimization. We validate the superiority of our design by demonstrating its ability to outperform the base diffusion model in accurately generating images according to prompts that necessitate both language and spatial reasoning. Additionally, our method naturally allows dialog-based scene specification and is able to handle prompts in a language that is not well-supported by the underlying diffusion model.

AK

83,657 görüntüleme • 2 yıl önce

o1 is here! Alistair sat down with shyamal at the OpenAI office to discuss how our unique techniques have unlocked new levels of performance. We teach LLMs to mimic human reasoning and benefit dramatically from new models, including o1.

o1 is here! Alistair sat down with shyamal at the OpenAI office to discuss how our unique techniques have unlocked new levels of performance. We teach LLMs to mimic human reasoning and benefit dramatically from new models, including o1.

Cosine

23,517 görüntüleme • 1 yıl önce

"If you are interested in human-level AI, don't work on LLMs." - Yann LeCun He recommends 👇 - Abandon generative models—the dominant approach today—because they are not the right path forward. Instead, focus on joint-embedding predictive architectures (JADs), which operate in representation space rather than generating outputs directly. Given the intractability of some problems, consider energy-based models instead of generative ones. - Avoid contrastive methods and favor regularized approaches. - Abandon reinforcement learning (RL). RL is inefficient and should only be a last resort when dealing with an inaccurate model or cost function. - If your goal is human-level AI, do not work on large language models (LLMs). The field is oversaturated, with academic researchers competing against industrial-scale resources, making meaningful contributions difficult. Instead, focus on fundamental problems such as efficient training with large-scale data. ------ Full Video from "DSAI by Dr. Osbert Tay" YT channel (link in comment)

"If you are interested in human-level AI, don't work on LLMs." - Yann LeCun He recommends 👇 - Abandon generative models—the dominant approach today—because they are not the right path forward. Instead, focus on joint-embedding predictive architectures (JADs), which operate in representation space rather than generating outputs directly. Given the intractability of some problems, consider energy-based models instead of generative ones. - Avoid contrastive methods and favor regularized approaches. - Abandon reinforcement learning (RL). RL is inefficient and should only be a last resort when dealing with an inaccurate model or cost function. - If your goal is human-level AI, do not work on large language models (LLMs). The field is oversaturated, with academic researchers competing against industrial-scale resources, making meaningful contributions difficult. Instead, focus on fundamental problems such as efficient training with large-scale data. ------ Full Video from "DSAI by Dr. Osbert Tay" YT channel (link in comment)

Rohan Paul

26,435 görüntüleme • 1 yıl önce

We spoke with Laura Ruis from Cohere For AI and UCL about her paper "Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models" where she demonstrated an interesting gap between retrieval and reasoning queries in LLMs indicating the presence of synthesised procedural knowledge generation.

We spoke with Laura Ruis from Cohere For AI and UCL about her paper "Procedural Knowledge in Pretraining Drives Reasoning in Large Language Models" where she demonstrated an interesting gap between retrieval and reasoning queries in LLMs indicating the presence of synthesised procedural knowledge generation.

Machine Learning Street Talk

10,997 görüntüleme • 1 yıl önce

Turn any AI Model into Reasoning Model with Deepseek r1 <thinking> Architecture. Models like GPT4o and Sonnet 3.5 are Implementation Models But a new breakthrough with Deepseek can make them a Reasoning model. Here's a step-by-step Explanation: 🧵

Turn any AI Model into Reasoning Model with Deepseek r1 <thinking> Architecture. Models like GPT4o and Sonnet 3.5 are Implementation Models But a new breakthrough with Deepseek can make them a Reasoning model. Here's a step-by-step Explanation: 🧵

CJ Zafir

192,667 görüntüleme • 1 yıl önce

💡Divergence thinking💡 is a hallmark of human creativity and problem-solving 🤖Can LLMs also do divergent reasoning to generate diverse solutions🤔? Introducing Flow-of-Reasoning (FoR) 🌊, a data-efficient way of training LLM policy to generate diverse, high-quality reasoning trajectories Unlike existing RL (like PPO) and planning (like MCTS) to find the max-reward trajectory (akin to convergent thinking), FoR connects LLM reasoning with the #GFlowNet formulation and enables LLMs to find trajectories proportional to reward distribution. 🎬The demo video illustrates how FoR learns and infers multiple solutions to a ♠️Game24 puzzle. 🎯Inferring for diverse solutions could be useful for robustness, data augmentation, and enhanced model generalization. Project page: Paper: Github:

💡Divergence thinking💡 is a hallmark of human creativity and problem-solving 🤖Can LLMs also do divergent reasoning to generate diverse solutions🤔? Introducing Flow-of-Reasoning (FoR) 🌊, a data-efficient way of training LLM policy to generate diverse, high-quality reasoning trajectories Unlike existing RL (like PPO) and planning (like MCTS) to find the max-reward trajectory (akin to convergent thinking), FoR connects LLM reasoning with the #GFlowNet formulation and enables LLMs to find trajectories proportional to reward distribution. 🎬The demo video illustrates how FoR learns and infers multiple solutions to a ♠️Game24 puzzle. 🎯Inferring for diverse solutions could be useful for robustness, data augmentation, and enhanced model generalization. Project page: Paper: Github:

Lianhui Qin

50,447 görüntüleme • 2 yıl önce

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

Gabriel Sarch

76,548 görüntüleme • 1 yıl önce

Turn any open-source LLM into reasoning powerhouse! Using reinforcement finetuning you can add reasoning abilities to any LLM, even without a labelled dataset. Step-by-step explanation with code:

Turn any open-source LLM into reasoning powerhouse! Using reinforcement finetuning you can add reasoning abilities to any LLM, even without a labelled dataset. Step-by-step explanation with code:

Akshay 🚀

50,423 görüntüleme • 1 yıl önce

🚀 1/7 We are thrilled to launch LLM360 — pushing the frontier of open-source & transparent LLMs! Starting with Amber (7B) & CrystalCoder (7B), we are releasing brand new pre-trained LLMs with all training code, data, and up to 360 model checkpoints. 🔗

LLM360

329,446 görüntüleme • 2 yıl önce

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

Wenhu Chen

82,829 görüntüleme • 1 yıl önce

It may seem counterintuitive to teach students with limited English proficiency to code. But research shows that learning a programming language has more in common with learning a natural language than you might think.💡

It may seem counterintuitive to teach students with limited English proficiency to code. But research shows that learning a programming language has more in common with learning a natural language than you might think.💡

edutopia

34,206 görüntüleme • 2 yıl önce