Loading video...

Video Failed to Load

There was a problem loading this video. This could be due to a temporary network issue or the video might be unavailable.

Introducing TraceVLA: a fully open-source Vision-Language-Action model reimagining spatial-temporal awareness: ✨ 3.5x gains on real robots, SOTA in simulation 💡 Fine-tunes on just 150K trajectories ⚡ Compact 4B model = 7B performance

Yongyuan Liang

2,821 subscribers

39,378 views • 1 year ago •via X (Twitter)

Science & Technology

Anya Rossi• Live Now

Private livecam show

11 Comments

Yongyuan Liang1 year ago

We introduce visual trace prompting:🔹Track robot's movement via point-tracking (Co-Tracker) 🔹Overlay traces on observations Model processes: 1️⃣ Original view (preserve full info) 2️⃣ View with traces as prompts A simple yet powerful technique to boost VLA's spatial understanding

Yongyuan Liang1 year ago

TraceVLA in action: Watch it excel at diverse manipulation tasks on a real WidowX-250 robot! From soft-object handling to precision pick-and-place, TraceVLA consistently outperforms OpenVLA in both in-distribution and out-of-distribution tasks.

Yongyuan Liang1 year ago

Superior simulation results: On Google’s SimplerEnv robot tasks, TraceVLA outshines OpenVLA across all metrics in both 7B and 4B versions! 🚀 20% boost in handling: ▪️ Camera changes ▪️ Distractors ▪️ Varied visual backgrounds

Yongyuan Liang1 year ago

Efficient and lightweight: 🔸 TraceVLA requires <10GB memory on 8 H100 GPUs 🔸 Adds only 0.036s per timestep A powerful VLA upgrade with minimal overhead!

Yongyuan Liang1 year ago

Available resources include: ▫️7B TraceVLA checkpoints ▫️Lightweight 4B Phi3V-OpenVLA & TraceVLA models ▫️Fine-tuned TraceVLA models 💻 Code: 🤗 Models: Try TraceVLA family models today!

Yongyuan Liang1 year ago

Check out our project page: ArXiv: Joint work with @ruijie_zheng12 @ShuaiyiH @JianfengGao0217 @haldaume3 @Andrey__Kolobov @furongh @jw2yang4ai

Yang1 year ago

Want to learn how practical AI skills and automations for your business and work? Check out our 50+ step-by-step video tutorials 100% FREE 20+ hours of Ai and Automation goodness absolutely free 🥳

Mu Cai @ Industry Job Market1 year ago

Congratulations! Really interesting work on applying visual prompts on VLA tasks!

Yongyuan Liang1 year ago

Thanks!!!

Dmytro Kuzmenko1 year ago

thank you very much for sharing, great idea and rather impressive results!

Ray | AI marketer - Social Media Assistant1 year ago

real-time engagement is key. we help brands connect with their audience 24/7, no burnout.

Related Videos

✨ Introducing 𝐎𝐩𝐞𝐧𝐕𝐋𝐀 — an open-source vision-language-action model for robotics! 👐 - SOTA generalist policy - 7B params - outperforms Octo, RT-2-X on zero-shot evals 🦾 - trained on 970k episodes from OpenX dataset 🤖 - fully open: model/code/data all online 🤗 🧵👇

✨ Introducing 𝐎𝐩𝐞𝐧𝐕𝐋𝐀 — an open-source vision-language-action model for robotics! 👐 - SOTA generalist policy - 7B params - outperforms Octo, RT-2-X on zero-shot evals 🦾 - trained on 970k episodes from OpenX dataset 🤖 - fully open: model/code/data all online 🤗 🧵👇

Moo Jin Kim

226,922 views • 2 years ago

DynamicVLA A compact 0.4B Vision-Language-Action model that finally lets robots manipulate *moving* objects in real-time, closing the perception-execution gap with Continuous Inference and Latent-aware Action Streaming.

DynamicVLA A compact 0.4B Vision-Language-Action model that finally lets robots manipulate moving objects in real-time, closing the perception-execution gap with Continuous Inference and Latent-aware Action Streaming.

DailyPapers

16,291 views • 4 months ago

Everyone is sleeping on this new OCR model! dots-ocr is a new 1.7B vision-language model that achieves SOTA performance on multilingual document parsing. - Supports 100+ languages - Works with both images and PDFs - Handles text, tables, formulas seamlessly 100% open-source.

Everyone is sleeping on this new OCR model! dots-ocr is a new 1.7B vision-language model that achieves SOTA performance on multilingual document parsing. - Supports 100+ languages - Works with both images and PDFs - Handles text, tables, formulas seamlessly 100% open-source.

Akshay 🚀

251,941 views • 9 months ago

Everyone is sleeping on this new OCR model! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support 100% open-source.

Everyone is sleeping on this new OCR model! - 85.9% (sota) on olmocr bench - 90+ language support w/benchmarks - 4B model (down from 9B) - Full layout information - Extracts + captions images and diagrams - Strong handwriting, math, form, table support 100% open-source.

Akshay 🚀

166,315 views • 2 months ago

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Sumanth

43,685 views • 9 months ago

📢 First contact between a frontier model and robots! Gemini Robotics is a SOTA generalist Vision-Language-Action model bringing frontier model intelligence to the physical world. It's an extremely capable model enabling dexterous, steerable, and general robot control. 🧵⬇️

📢 First contact between a frontier model and robots! Gemini Robotics is a SOTA generalist Vision-Language-Action model bringing frontier model intelligence to the physical world. It's an extremely capable model enabling dexterous, steerable, and general robot control. 🧵⬇️

Ted Xiao

152,395 views • 1 year ago

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking potential for robotic foundation model ❗Over 30x pretraining efficiency for VLA training 🤗Code and checkpoints are all open-sourced!

🚀 First step to unlocking Generalist Robots! Introducing 🤖LAPA🤖, a new SOTA open-sourced 7B VLA pretrained without using action labels. 💪SOTA VLA trained with Open X (outperforming OpenVLA on cross and multi embodiment) 😯LAPA enables learning from human videos, unlocking potential for robotic foundation model ❗Over 30x pretraining efficiency for VLA training 🤗Code and checkpoints are all open-sourced!

Seonghyeon Ye

33,232 views • 1 year ago

CMU Vision-Language-Autonomy update: The team just released SORT3D, the first general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems! 🤖👀 Simulation and real-robot data is provided!:

CMU Vision-Language-Autonomy update: The team just released SORT3D, the first general spatial relation toolbox for autonomous vision-language navigation that is fully integrated into real-robot systems! 🤖👀 Simulation and real-robot data is provided!:

CMU Robotics Institute

18,234 views • 1 year ago

SpatialLM just dropped on Hugging Face Large Language Model for Spatial Understanding

SpatialLM just dropped on Hugging Face Large Language Model for Spatial Understanding

AK

673,493 views • 1 year ago

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

AI at Meta

93,811 views • 1 year ago

Cohere transcribe Sota open source transcription model running in the browser :) Weights on Hugging Face link below

Cohere transcribe Sota open source transcription model running in the browser :) Weights on Hugging Face link below

Nick Frosst

190,679 views • 2 months ago

Introducing Jan-Code-4B 💻 A compact coding model tuned for practical day-to-day tasks. Generation, refactors, debugging, tests — all runnable locally in Jan. Download Jan: Model:

Introducing Jan-Code-4B 💻 A compact coding model tuned for practical day-to-day tasks. Generation, refactors, debugging, tests — all runnable locally in Jan. Download Jan: Model:

👋 Jan

109,873 views • 3 months ago

No words. Just wow. LingBot-World: A playable open-source world model built on Wan2.2+Qwen3-VL-2B; - real-time interactive simulation at 720p@16fps, <1s latency; - minute-long contextual memory. - open source!

No words. Just wow. LingBot-World: A playable open-source world model built on Wan2.2+Qwen3-VL-2B; - real-time interactive simulation at 720p@16fps, <1s latency; - minute-long contextual memory. - open source!

Wildminder

36,604 views • 4 months ago

Sheepy-T: A fully open-source instruction-tuned language model based on GPT-J running locally on iPhone 14. Reply for beta access via TestFlight.

Sheepy-T: A fully open-source instruction-tuned language model based on GPT-J running locally on iPhone 14. Reply for beta access via TestFlight.

Kevin Kwok

102,354 views • 3 years ago

Introducing OFT—an Optimized Fine-Tuning recipe for VLAs! Fine-tuning OpenVLA w/ OFT, we see: -25-50x faster inference ⚡️ -SOTA 97.1% avg SR in LIBERO 💪 -high-freq control w/ 7B model on real bimanual robot -outperforms π₀, RDT-1B, DiT Policy, MDT, Diffusion Policy, ACT 🧵👇

Introducing OFT—an Optimized Fine-Tuning recipe for VLAs! Fine-tuning OpenVLA w/ OFT, we see: -25-50x faster inference ⚡️ -SOTA 97.1% avg SR in LIBERO 💪 -high-freq control w/ 7B model on real bimanual robot -outperforms π₀, RDT-1B, DiT Policy, MDT, Diffusion Policy, ACT 🧵👇

Moo Jin Kim

84,133 views • 1 year ago

OpenAI just released GPT-OSS: An Open Source Language Model on Hugging Face Open source meaning: 💸 Free 🔒 Private 🔧 Customizable

OpenAI just released GPT-OSS: An Open Source Language Model on Hugging Face Open source meaning: 💸 Free 🔒 Private 🔧 Customizable

dylan

21,568 views • 10 months ago

Microsoft just dropped MineWorld on Hugging Face a Real-Time and Open-Source Interactive World Model on Minecraft

Microsoft just dropped MineWorld on Hugging Face a Real-Time and Open-Source Interactive World Model on Minecraft

AK

95,035 views • 1 year ago

Introducing Jan-v1: 4B model for web search, an open-source alternative to Perplexity Pro. In our evals, Jan v1 delivers 91% SimpleQA accuracy, slightly outperforming Perplexity Pro while running fully locally. Use cases: - Web search - Deep Research Built on the new version of Qwen's Qwen3-4B-Thinking (up to 256k context length), fine-tuned for reasoning and tool use in Jan. You can run the model in Jan, llama.cpp, or vLLM. To enable search in Jan, go to Settings → Experimental Features → On, then Settings → MCP Servers → enable a search-related MCP such as Serper. Use the model: - Jan-v1-4B: - Jan-v1-4B-GGUF: Credit to the Qwen team for Qwen3 4B Thinking & Georgi Gerganov for llama.cpp.

Introducing Jan-v1: 4B model for web search, an open-source alternative to Perplexity Pro. In our evals, Jan v1 delivers 91% SimpleQA accuracy, slightly outperforming Perplexity Pro while running fully locally. Use cases: - Web search - Deep Research Built on the new version of Qwen's Qwen3-4B-Thinking (up to 256k context length), fine-tuned for reasoning and tool use in Jan. You can run the model in Jan, llama.cpp, or vLLM. To enable search in Jan, go to Settings → Experimental Features → On, then Settings → MCP Servers → enable a search-related MCP such as Serper. Use the model: - Jan-v1-4B: - Jan-v1-4B-GGUF: Credit to the Qwen team for Qwen3 4B Thinking & Georgi Gerganov for llama.cpp.

👋 Jan

689,722 views • 10 months ago

We just dropped a new SoTA lipsync model on fal: Hummingbird-0 Available now as a research preview, it's the most accurate zero-shot lipsync model we’ve tested, open or closed source.

We just dropped a new SoTA lipsync model on fal: Hummingbird-0 Available now as a research preview, it's the most accurate zero-shot lipsync model we’ve tested, open or closed source.

Tavus

460,422 views • 1 year ago