Loading video...
Video Failed to Load
🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in... show more
82,829 views • 1 year ago •via X (Twitter)
7 Comments

We found that using standard GRPO does not work well because VLMs tend to ignore these visual operations. Therefore, we propose the curiosity-driven reward to incentivize the model to use visual operations properly but not over-use it. RaPR is the ratio of rollouts in one group that use visual operations. 1_{PR} means whether a specific rollout uses visual operations. H is a threashhold. So the r_curiosity will reward the individual rollout in the groups which have low visual operation rate. r_penalty will penalize the over-use of the visual operations to prevent reward hacking. This reward design is the key to build Pixel Reasoner.

Great work led by Alex Su and Haozhe Wang, in collaboration with HKUST and USTC.

Can Machine Learning beat the market? Check out this post on my free Substack where I share code and commentary for an XGBoost model and a Random Forest model that both deliver powerful performances.

Very cool work! We are also exploring reasoning with image, but through image generation as imagination. If you are interested, feel free to take a look!

Congratulations! 加油!

Wow. Here's another o3 inspired work:

Very cool!

