ๆญฃๅœจๅŠ ่ฝฝ่ง†้ข‘...

่ง†้ข‘ๅŠ ่ฝฝๅคฑ่ดฅ

โœจ Introducing ๐Ž๐ฉ๐ž๐ง๐•๐‹๐€ โ€” an open-source vision-language-action model for robotics! ๐Ÿ‘ - SOTA generalist policy - 7B params - outperforms Octo, RT-2-X on zero-shot evals ๐Ÿฆพ - trained on 970k episodes from OpenX dataset ๐Ÿค– - fully open: model/code/data all online ๐Ÿค— ๐Ÿงต๐Ÿ‘‡

226,922 ๆฌก่ง‚็œ‹ โ€ข 2 ๅนดๅ‰ โ€ขvia X (Twitter)

11 ๆก่ฏ„่ฎบ

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[2/9] OpenVLA generalizes better overall and shows stronger language grounding than prior SOTA generalist models โ€” RT-1-X, Octo, and even closed-source RT-2-X โ€” across a suite of 17 WidowX robot tasks + 12 Google robot tasks.

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[3/9] OpenVLA can also be fully fine-tuned on new robot setups/tasks with just 10-150 demos and outperform from-scratch Diffusion Policy on diverse multi-instruction tasks with distractor objects in the scene.

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[4/9] Additionally, OpenVLA can be fine-tuned via PEFT (LoRA) on a single 48GB GPU โ€” training only 1.4% of the parameters but still matching full fine-tuning performance on Franka Panda fine-tuning tasks.

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[5/9] Further, by using 4-bit quantization at inference time, the OpenVLA model can be loaded with less than half the normal required GPU memory and complete BridgeData V2 WidowX tasks without compromising performance.

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[6/9] How does OpenVLA work? TL;DR: We take a 7B-parameter Prismatic VLM โ€“ with a fused DinoV2-SigLIP vision encoder and a Llama 2 LLM backbone โ€“ and fine-tune it on a ton of robot action data. - nearly 1M robot episodes - almost 30 robotic manipulation datasets

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[7/9] Unlike prior SOTA VLA model RT-2-X, we open-source our model, training & inference code, and OpenX training data mixture! ๐Ÿค— See all this and more info at our website! ๐Ÿ‘‰ ๐Ÿ‘ˆ

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[8/9] OpenVLA is the *first* open-source VLM-based robotic foundation model trained on large-scale real-world robot manipulation data. We hope that our model and training frameworks are useful resources to the robot learning community that help advance embodied AI research!

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

๐Ÿงต[9/9] Huge thanks to project co-leads, @KarlPertsch and @siddkaramcheti, for making this project possible! โค๏ธ Also, so grateful for all my collaborators โ€“ from @Stanford, @UCBerkeley, @MIT, @ToyotaResearch, @GoogleDeepMind, and @physical_int. ๐Ÿ™

Chuang Gan ็š„ๅคดๅƒ
Chuang Gan2 ๅนดๅ‰

Very impressive work! You might not realize that we have an open-source 3D-VLA, published at ICML this year ๐Ÿ˜€. Code:

Karol Hausman ็š„ๅคดๅƒ
Karol Hausman2 ๅนดๅ‰

Very cool, congrats!

Moo Jin Kim ็š„ๅคดๅƒ
Moo Jin Kim2 ๅนดๅ‰

@hausman_k Thank you!!

็›ธๅ…ณ่ง†้ข‘