Video yükleniyor...
Video Yüklenemedi
Excited to finally share Generative Value Learning (GVL), my Google DeepMind project on extracting universal value functions from long-context VLMs via in-context learning! We discovered a simple method to generate zero-shot and few-shot values for 300+ robot tasks and 50+ datasets using SOTA VLMs like Gemini (Try out the... show more
98,090 görüntüleme • 1 yıl önce •via X (Twitter)
10 Yorum

First, check out our project website for the paper, interactive demos, and getting your robot video labeled by GVL today! You can even listen to an AI podcast about our paper, or ask Gemini questions about our paper too! We (especially @xf1280) put in a lot of effort in getting these demos up. Let us know how you find these new ways to engage with paper!

Value function is a fundamental component of robotics; it can be used for search, planning, RL, success detection, and many more applications. However, learning a universal value function (UVF) for many robots and tasks has been extremely challenging and traditional value learning algorithms have not shown to scale. In this paper, we explore a totally new direction and ask: Can SOTA VLMs with all its world knowledge and capabilities be repurposed to be universal value functions for all robots and tasks?

The answer is yes, and the method is simple yet intriguing! We propose formulating value learning as an autoregressive prediction task over *shuffled* sequence of the input video. Why? Think about a standard video showing a task unfolding in chronological order. We empirically find that this actually makes it harder for the VLM to estimate progress because it might just latch onto the order of the frames instead of the underlying changes that signify actual progress towards completing the task. By shuffling, we force the VLM to work “harder” to figure out the correct order based on the visual cues of task progress, and doing so significantly improves the faithfulness of the value predictions! In a way, GVL poses value predictions as an ‘’temporal unshuffling’’ puzzle to the VLM; it has all the pieces, but it has to figure out how those pieces fit together in a way that makes sense based on progress towards a goal.

GVL can zero-shot generate dense values and captions for diverse robots, tasks, and viewpoints! Here, we show some examples of GVL on really long-horizon and challenging viewpoints, including laundry folding from @physical_int, shirt hanging from ALOHA Unleashed (@tonyzzhao @ayzwah), wrist camera trajectories from DOBBE (@notmahi) and UMI (@chichengcc). Check out our project website for many additional results!

@xf1280 @physical_int @tonyzzhao More examples on some OXE-datasets and even navigation video! No modification to the algorithm or fine-tuning to VLM needed!

What’s very appealing about GVL is that it can leverage in-context learning to improve its value predictions! By simply pre-pending shuffled frame-value pairs in the VLM context, we find the value prediction quality to steadily improve on a challenging set of 250 ALOHA tasks! The long-context window enables us to pack as many as 5 trajectories (>150 frames) in-context, and we still see performance boost!

GVL can even benefit from cross-embodiment and cross-task in-context learning! That is, we can feed shuffled frames of humans or robots performing other tasks and their values as context, and we again see performance improvement!

The generality of GVL enables many downstream applications, including dataset quality, success detection, and policy learning! I am very excited about the dataset quality estimation results, because it is a new way of using value models and very relevant to today’s robot learning landscape where models are trained on mixtures of datasets, and practitioners need good ways of determining what datasets are high quality. Check out the paper for more details on these applications!

I'd like to thank all my collaborators for making this a super fun and rewarding project: @JoeyHejna @ayzwah @ChuyuanFu @shahdhruv_ @jackyliang42 @drzhuoxu @SeanKirmani @sippeyxp @DannyDriess @xiao_ted @JonathanTompson @obastani @dineshjayaraman @Stacormed @tingnan1986 @DorsaSadigh @xf1280 . Many of them are currently at @corl_conf , make sure to talk to them about our paper! I am particularly grateful to @xf1280 for his mentorship and guidance throughout this project; I benefited a lot from his expertise and insights on frontier VLMs for robotics!

@GoogleDeepMind Great work bro 👍
