Minqi Jiang's banner
Minqi Jiang's profile picture

Minqi Jiang

@MinqiJiang6,409 subscribers

Shorts

What if you kept asking an LLM to "make it better"? In some recent work at FAIR, we investigate how we can efficiently use RL to fine-tune LLMs to iteratively self-improve on their previous solutions at inference-time. Training for iterated self-improvement can be costly. The naive approach to training for K self-improvement steps leads to K times the number of rollout steps per episode. We introduce Exploratory Iteration (ExIt), an RL-based automatic curriculum method that bootstraps diverse training distributions of self-improvement tasks by upcycling the LLM's own responses at previous turns as the starting points for both self-improvement and *self-divergence.* In order to decide what task to train on next, the curriculum prioritizes sampling of partial turn histories that led to higher return variance in its GRPO group (a learnability score that comes for free). This automatic curriculum over the bootstrapped task space teaches the model how to perform iterated self-improvement while only ever training the model on single-step self-improvement tasks. We look at ExIt's impact in both single-turn (contest math problems) and multi-turn (BFCLv3 multi-turn tasks), as well as MLE-bench, where the LLM is run in a search scaffold to produce solutions to real Kaggle competitions. Across these eval settings, we find ExIt produces models with greater capacity for inference-time self-improvement compared to GRPO. Notably, ExIt models can self-improve on test tasks for many more steps than the typical solution depth encountered during training, including a 22% improvement in MLE-bench performance compared to GRPO.

What if you kept asking an LLM to "make it better"? In some recent work at FAIR, we investigate how we can efficiently use RL to fine-tune LLMs to iteratively self-improve on their previous solutions at inference-time. Training for iterated self-improvement can be costly. The naive approach to training for K self-improvement steps leads to K times the number of rollout steps per episode. We introduce Exploratory Iteration (ExIt), an RL-based automatic curriculum method that bootstraps diverse training distributions of self-improvement tasks by upcycling the LLM's own responses at previous turns as the starting points for both self-improvement and *self-divergence.* In order to decide what task to train on next, the curriculum prioritizes sampling of partial turn histories that led to higher return variance in its GRPO group (a learnability score that comes for free). This automatic curriculum over the bootstrapped task space teaches the model how to perform iterated self-improvement while only ever training the model on single-step self-improvement tasks. We look at ExIt's impact in both single-turn (contest math problems) and multi-turn (BFCLv3 multi-turn tasks), as well as MLE-bench, where the LLM is run in a search scaffold to produce solutions to real Kaggle competitions. Across these eval settings, we find ExIt produces models with greater capacity for inference-time self-improvement compared to GRPO. Notably, ExIt models can self-improve on test tasks for many more steps than the typical solution depth encountered during training, including a 22% improvement in MLE-bench performance compared to GRPO.

41,048 Aufrufe