正在加载视频...
视频加载失败
Introducing OFT—an Optimized Fine-Tuning recipe for VLAs! Fine-tuning OpenVLA w/ OFT, we see: -25-50x faster inference ⚡️ -SOTA 97.1% avg SR in LIBERO 💪 -high-freq control w/ 7B model on real bimanual robot -outperforms π₀, RDT-1B, DiT Policy, MDT, Diffusion Policy, ACT 🧵👇
11 条评论

We study key design decisions when fine-tuning VLAs to novel robots/tasks, exploring different: -action decoding schemes (autoregressive vs parallel) -action representations (discrete vs continuous) -learning objectives (next-token prediction vs L1 regression vs diffusion) 2/9

OpenVLA originally uses autoregressive decoding, discrete actions, & next-token prediction for learning. We find that fine-tuning OpenVLA w/ OFT—parallel decoding w/ action chunking, continuous actions, and L1 regression—dramatically boosts inference speed + success rate! 3/9

In the LIBERO sim benchmark, OFT improves OpenVLA’s action generation throughput by 26x and avg success from 76.5% to 97.1% (SOTA). 🦾 Shows that just plain old imitation learning w/ a strong base VLA + well-designed fine-tuning recipe can go quite far! 4/9

In real ALOHA robot tasks, we add FiLM for better language grounding & call the augmented recipe "OFT+". OFT+ speeds up OpenVLA inference by 43x, helps it outperform fine-tuned VLAs (RDT-1B + pi0) and from-scratch policies (ACT + Diff Policy), & enhances language following. 5/9

The large gains in inference efficiency give us headroom to process additional model inputs. Now with OFT+, OpenVLA can generate 14-D dual-arm robot actions at 78 Hz, even w/ 3 input images (768 total visual patches)! (See the OpenVLA-OFT+ figure for architecture details.) 6/9

We discovered surprising things in this project & hope you learn from it, too! We open-source our project so that anyone can use the OFT recipe & fine-tuned VLAs. Hope the resources are useful to the community! 🤗 Paper, code, & models below: 👉 👈 7/9

Very grateful to @chelseabfinn and @percyliang who provided super helpful advice all throughout this project. Thank you! 🙏 Also, thank you to everyone who used OpenVLA in their own works. We hope that our new fine-tuning recipe is also useful to robot learning folks! 8/9

Bonus video: Here's OpenVLA-OFT+ completing tasks and resetting the environment by itself—fully autonomously, via imitation learning only. It executes the forward task (scoop X into bowl) & backward task (pour X into container) in 6 consecutive episodes. (15x video speed) 9/9

Our speech-to-text models are the most accurate on the market with top rankings across industry benchmarks. - The highest accuracy rates—up to 95% - Up to 30% fewer hallucinations than other leaders - Low latency—63 minutes converts in 35 seconds Try via API for free today 👇

Great Work, btw, the link on the website seems still pointing to the OpenVLA

@RchalYang Thank you! Fixed.

