正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Introducing OFT—an Optimized Fine-Tuning recipe for VLAs! Fine-tuning OpenVLA w/ OFT, we see: -25-50x faster inference ⚡️ -SOTA 97.1% avg SR in LIBERO 💪 -high-freq control w/ 7B model on real bimanual robot -outperforms π₀, RDT-1B, DiT Policy, MDT, Diffusion Policy, ACT 🧵👇

Moo Jin Kim

2,163 subscribers

84,133 次观看 • 1 年前 •via X (Twitter)

游戏科学技术教育

Anya Rossi• Live Now

Private livecam show

11 条评论

Moo Jin Kim 的头像

Moo Jin Kim1 年前

We study key design decisions when fine-tuning VLAs to novel robots/tasks, exploring different: -action decoding schemes (autoregressive vs parallel) -action representations (discrete vs continuous) -learning objectives (next-token prediction vs L1 regression vs diffusion) 2/9

Moo Jin Kim 的头像

Moo Jin Kim1 年前

OpenVLA originally uses autoregressive decoding, discrete actions, & next-token prediction for learning. We find that fine-tuning OpenVLA w/ OFT—parallel decoding w/ action chunking, continuous actions, and L1 regression—dramatically boosts inference speed + success rate! 3/9

Moo Jin Kim 的头像

Moo Jin Kim1 年前

In the LIBERO sim benchmark, OFT improves OpenVLA’s action generation throughput by 26x and avg success from 76.5% to 97.1% (SOTA). 🦾 Shows that just plain old imitation learning w/ a strong base VLA + well-designed fine-tuning recipe can go quite far! 4/9

Moo Jin Kim 的头像

Moo Jin Kim1 年前

In real ALOHA robot tasks, we add FiLM for better language grounding & call the augmented recipe "OFT+". OFT+ speeds up OpenVLA inference by 43x, helps it outperform fine-tuned VLAs (RDT-1B + pi0) and from-scratch policies (ACT + Diff Policy), & enhances language following. 5/9

Moo Jin Kim 的头像

Moo Jin Kim1 年前

The large gains in inference efficiency give us headroom to process additional model inputs. Now with OFT+, OpenVLA can generate 14-D dual-arm robot actions at 78 Hz, even w/ 3 input images (768 total visual patches)! (See the OpenVLA-OFT+ figure for architecture details.) 6/9

Moo Jin Kim 的头像

Moo Jin Kim1 年前

We discovered surprising things in this project & hope you learn from it, too! We open-source our project so that anyone can use the OFT recipe & fine-tuned VLAs. Hope the resources are useful to the community! 🤗 Paper, code, & models below: 👉 👈 7/9

Moo Jin Kim 的头像

Moo Jin Kim1 年前

Very grateful to @chelseabfinn and @percyliang who provided super helpful advice all throughout this project. Thank you! 🙏 Also, thank you to everyone who used OpenVLA in their own works. We hope that our new fine-tuning recipe is also useful to robot learning folks! 8/9

Moo Jin Kim 的头像

Moo Jin Kim1 年前

Bonus video: Here's OpenVLA-OFT+ completing tasks and resetting the environment by itself—fully autonomously, via imitation learning only. It executes the forward task (scoop X into bowl) & backward task (pour X into container) in 6 consecutive episodes. (15x video speed) 9/9

AssemblyAI 的头像

AssemblyAI1 年前

Our speech-to-text models are the most accurate on the market with top rankings across industry benchmarks. - The highest accuracy rates—up to 95% - Up to 30% fewer hallucinations than other leaders - Low latency—63 minutes converts in 35 seconds Try via API for free today 👇

Ruihan Yang 的头像

Ruihan Yang1 年前

Great Work, btw, the link on the website seems still pointing to the OpenVLA

Moo Jin Kim 的头像

Moo Jin Kim1 年前

@RchalYang Thank you! Fixed.

相关视频

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

Moo Jin Kim

148,803 次观看 • 4 个月前

✨ Introducing 𝐎𝐩𝐞𝐧𝐕𝐋𝐀 — an open-source vision-language-action model for robotics! 👐 - SOTA generalist policy - 7B params - outperforms Octo, RT-2-X on zero-shot evals 🦾 - trained on 970k episodes from OpenX dataset 🤖 - fully open: model/code/data all online 🤗 🧵👇

✨ Introducing 𝐎𝐩𝐞𝐧𝐕𝐋𝐀 — an open-source vision-language-action model for robotics! 👐 - SOTA generalist policy - 7B params - outperforms Octo, RT-2-X on zero-shot evals 🦾 - trained on 970k episodes from OpenX dataset 🤖 - fully open: model/code/data all online 🤗 🧵👇

Moo Jin Kim

226,922 次观看 • 2 年前

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,392 次观看 • 1 年前

Why don’t VLAs generalize as well as their VLM counterparts? One culprit: catastrophic forgetting during fine-tuning. 🧠 We introduce VLM2VLA: a training paradigm that preserves the VLM capabilities while teaching robotic control. 🧵

Why don’t VLAs generalize as well as their VLM counterparts? One culprit: catastrophic forgetting during fine-tuning. 🧠 We introduce VLM2VLA: a training paradigm that preserves the VLM capabilities while teaching robotic control. 🧵

Anirudha Majumdar

60,486 次观看 • 8 个月前

🚀 We’re excited to announce LingBot-VA, a new state-of-the-art robot policy model from Robbyant ! LingBot-VA is built on a causal, autoregressive video-action world model for generalist robot control. Highlights: (1) First unified autoregressive video-action world model for robot control (2) Low-latency inference with a new asynchronous execution pipeline (3) SOTA on RoboTwin (92.9%, firstever > 90%) and LIBERO (98.5%) (4) +20% over π0.5 on challenging real-world long-horizon & high-precision tasks

🚀 We’re excited to announce LingBot-VA, a new state-of-the-art robot policy model from Robbyant ! LingBot-VA is built on a causal, autoregressive video-action world model for generalist robot control. Highlights: (1) First unified autoregressive video-action world model for robot control (2) Low-latency inference with a new asynchronous execution pipeline (3) SOTA on RoboTwin (92.9%, firstever > 90%) and LIBERO (98.5%) (4) +20% over π0.5 on challenging real-world long-horizon & high-precision tasks

Yinghao Xu

46,380 次观看 • 4 个月前

LLM post-training used to mean fine-tuning to a downstream task Robotics has been stuck in this setting, needing task-specific fine-tuning for best performance π07 changes this: It works out of the box & outperforms fine-tuned specialists Details:

LLM post-training used to mean fine-tuning to a downstream task Robotics has been stuck in this setting, needing task-specific fine-tuning for best performance π07 changes this: It works out of the box & outperforms fine-tuned specialists Details:

Chelsea Finn

60,007 次观看 • 2 个月前

6/ ASCII plays DOOM Fine-tuning Mistral 7B to play DOOM based on ASCII frame representations. Yes, it actually works. Umut Hope YILDIRIM Sammy Aťman Paul Chu 🥇First place fine tuning track

6/ ASCII plays DOOM Fine-tuning Mistral 7B to play DOOM based on ASCII frame representations. Yes, it actually works. Umut Hope YILDIRIM Sammy Aťman Paul Chu 🥇First place fine tuning track

Alex Reibman 🖇️

18,717 次观看 • 2 年前

intensity, precision, conditioning and passion.. fine tuning the fine tuning 🤼‍♀️

intensity, precision, conditioning and passion.. fine tuning the fine tuning 🤼‍♀️

Nattie

55,627 次观看 • 1 年前

Fine-tuning Mistral 7B with LoRA on a 32 GB M1 (laptop!) in MLX Updated example uses less RAM + support for custom datasets 🚀

Fine-tuning Mistral 7B with LoRA on a 32 GB M1 (laptop!) in MLX Updated example uses less RAM + support for custom datasets 🚀

Awni Hannun

148,141 次观看 • 2 年前

Introducing TraceVLA: a fully open-source Vision-Language-Action model reimagining spatial-temporal awareness: ✨ 3.5x gains on real robots, SOTA in simulation 💡 Fine-tunes on just 150K trajectories ⚡ Compact 4B model = 7B performance

Introducing TraceVLA: a fully open-source Vision-Language-Action model reimagining spatial-temporal awareness: ✨ 3.5x gains on real robots, SOTA in simulation 💡 Fine-tunes on just 150K trajectories ⚡ Compact 4B model = 7B performance

Yongyuan Liang

39,488 次观看 • 1 年前

Introducing Core Sandbox Alpha, the first inference-time learning coding assistant. Every interaction affects reasoning immediately. No RL. No retraining. No fine-tuning.

Introducing Core Sandbox Alpha, the first inference-time learning coding assistant. Every interaction affects reasoning immediately. No RL. No retraining. No fine-tuning.

Rei

52,493 次观看 • 6 个月前

Fine-tuning ahead of kick-off. 💪 #CFC | #BOUCHE

Fine-tuning ahead of kick-off. 💪 #CFC | #BOUCHE

Chelsea FC

88,612 次观看 • 6 个月前

IN: video fine-tuning support for AI at Meta's V-JEPA 2 in HF transformers 🔥 it comes with > fine-tuning notebook > four models fine-tuned on Diving48 and SSv2 dataset > FastRTC demo on V-JEPA2 SSv2 (see below) we're looking forward to see fine-tuned V-JEPA2 models on Hub ⏯️

IN: video fine-tuning support for AI at Meta's V-JEPA 2 in HF transformers 🔥 it comes with > fine-tuning notebook > four models fine-tuned on Diving48 and SSv2 dataset > FastRTC demo on V-JEPA2 SSv2 (see below) we're looking forward to see fine-tuned V-JEPA2 models on Hub ⏯️

merve

15,625 次观看 • 1 年前

Fine tuning

Fine tuning

Detroit Lions

32,730 次观看 • 8 个月前

Try pi-05 from Physical Intelligence today on your robot. One click fine-tuning and inference from phosphobot. No code and no GPU required. -> docs .phospho .ai

Try pi-05 from Physical Intelligence today on your robot. One click fine-tuning and inference from phosphobot. No code and no GPU required. -> docs .phospho .ai

Pierre-Louis Biojout (PLB)

12,322 次观看 • 8 个月前

We Luis Felipe Casas have successfully deployed an in-hand manipulation policy to the LEAP hand following RL training in MuJoCo Playground. Sim-to-real isn’t as straightforward as it seems. Policy design and real-world tuning matter. Working on improving the policy further.

We Luis Felipe Casas have successfully deployed an in-hand manipulation policy to the LEAP hand following RL training in MuJoCo Playground. Sim-to-real isn’t as straightforward as it seems. Policy design and real-world tuning matter. Working on improving the policy further.

Yu Xiang

10,962 次观看 • 2 个月前

Real-time inference is a big challenge for VLAs. We’ve been working on a way to amortize inference delays in π0.5. Our new Real-Time Chunking (RTC) method speeds up π0.5 by allowing the robot to “think” while it’s moving, which makes it quite a bit faster! 🧵👇

Real-time inference is a big challenge for VLAs. We’ve been working on a way to amortize inference delays in π0.5. Our new Real-Time Chunking (RTC) method speeds up π0.5 by allowing the robot to “think” while it’s moving, which makes it quite a bit faster! 🧵👇

Sergey Levine

73,678 次观看 • 1 年前

Thrilled to announce Octo 🐙, an open-source robot foundation model! Octo is a sota generalist robot policy based on transformer+diffusion. Most importantly, you can finetune Octo *today* with flexible observation and action spaces on your robot setup!

Thrilled to announce Octo 🐙, an open-source robot foundation model! Octo is a sota generalist robot policy based on transformer+diffusion. Most importantly, you can finetune Octo today with flexible observation and action spaces on your robot setup!

Oier Mees

44,944 次观看 • 2 年前

Fine-tuning details for the show in Chile #FiestaGrado3

Fine-tuning details for the show in Chile #FiestaGrado3

A*Teens

29,925 次观看 • 1 年前

today we're introducing lipsync-2, the world's first zero-shot lipsyncing model that preserves a speaker's unique style w/o additional training or fine-tuning lipsync-2 is a leap forward in realism, expressiveness, control, quality, and speed across live-action, animated, and AI-generated video lipsync-2 is rolling out in GA today 🧵

today we're introducing lipsync-2, the world's first zero-shot lipsyncing model that preserves a speaker's unique style w/o additional training or fine-tuning lipsync-2 is a leap forward in realism, expressiveness, control, quality, and speed across live-action, animated, and AI-generated video lipsync-2 is rolling out in GA today 🧵

sync. labs

89,604 次观看 • 1 年前