Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

How can we help any image-input policy generalize better? 👉 Meet PEEK 🤖 — a framework that uses VLMs to decide where to look and what to do, so downstream policies — from ACT, 3D-DA, or even π₀ — generalize more effectively! 🧵

Jesse Zhang @ RSS 2026 ✈️

1,755 subscribers

13,754 просмотров • 9 месяцев назад •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

Комментарии: 0

Нет доступных комментариев

Здесь появятся комментарии из оригинального поста

Похожие видео

New research from BlackForestLabsAI - Unofficial 🥳 Meet Self-Flow: our self-supervised framework for image, audio, video & world models 🤖 Do generative models really need DINO to learn strong representations? We propose teaching them directly via a joint framework instead 🧵

New research from BlackForestLabsAI - Unofficial 🥳 Meet Self-Flow: our self-supervised framework for image, audio, video & world models 🤖 Do generative models really need DINO to learn strong representations? We propose teaching them directly via a joint framework instead 🧵

Hila Chefer

68,586 просмотров • 4 месяцев назад

🚨 Without Any Motion Priors, how to make humanoids do versatile parkour jumping🦘, clapping dance🤸, cliff traversal🧗, and box pick-and-move📦 with a unified RL framework? Introduce WoCoCo: 🧗 Whole-body humanoid Control with sequential Contacts 🎯Unified designs for minimal tuning across tasks 🤖Generalize to various high-DoF robots Website:

🚨 Without Any Motion Priors, how to make humanoids do versatile parkour jumping🦘, clapping dance🤸, cliff traversal🧗, and box pick-and-move📦 with a unified RL framework? Introduce WoCoCo: 🧗 Whole-body humanoid Control with sequential Contacts 🎯Unified designs for minimal tuning across tasks 🤖Generalize to various high-DoF robots Website:

Wenli Xiao

70,332 просмотров • 2 лет назад

New work with Alec Radford and David Duvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text. Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:

New work with Alec Radford and David Duvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text. Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:

Nick Levine

1,225,914 просмотров • 3 месяцев назад

🤖Robots that think ahead and act in real time. LingBot-VA 2.0 — the first embodied-native foundation model. Not fine-tuned from a video generator. Built from scratch for the physical world. ✅ 93.6% success on bimanual tasks ⚡ 150 Hz single-GPU inference 🎯 20 demos to generalize. This is what happens when you stop adapting and start building natively. 🧵👇

🤖Robots that think ahead and act in real time. LingBot-VA 2.0 — the first embodied-native foundation model. Not fine-tuned from a video generator. Built from scratch for the physical world. ✅ 93.6% success on bimanual tasks ⚡ 150 Hz single-GPU inference 🎯 20 demos to generalize. This is what happens when you stop adapting and start building natively. 🧵👇

Robbyant

886,775 просмотров • 18 дней назад

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about Mimic Robotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

Robotics fundamentally involves understanding the dynamics of how things change in the world in response to action and force. This is impossible to learn from static images; instead, it’s far more effective and more data-efficient to learn from video. Elvis Nava joins us to talk about Mimic Robotic. One of the key findings from mimic-video is that pretraining on webscale video allows robots to learn physics priors; as a result, policies train faster, generalize better, and are capable of more impressive dexterity, versus training on static images or image-language pairs as per a VLM. Watch Episode #81 of RoboPapers with Michael Cho - Rbt/Acc and Chris Paxton to learn more!

RoboPapers

46,190 просмотров • 2 месяцев назад

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

How can we get VLMs to move their eyes—and reason step-by-step in visually grounded ways? 👀 We introduce ViGoRL, a RL method that anchors reasoning to image regions. 🎯 It outperforms vanilla GRPO and SFT across grounding, spatial tasks, and visual search (86.4% on V*). 👇🧵

Gabriel Sarch

76,548 просмотров • 1 год назад

I built software to cut docker image sizes by more than 80%. It uses dynamic analysis and and has a LOT more room for improvement. The demo here shows an ETL workflow docker image, cut down from 1.28GB to 193MB (-85%)! If large docker image sizes are becoming a production problem, I'd be happy to explore how we can help.

I built software to cut docker image sizes by more than 80%. It uses dynamic analysis and and has a LOT more room for improvement. The demo here shows an ETL workflow docker image, cut down from 1.28GB to 193MB (-85%)! If large docker image sizes are becoming a production problem, I'd be happy to explore how we can help.

omkaar

353,187 просмотров • 1 год назад

If you let VLMs experiment on their own, they can do surprising things! From an image, we let a VLM code a 3D scene from scratch in Blender, and then render to verify/refine in a loop. Even when each step is imperfect, it gets results like this, with zero training.

If you let VLMs experiment on their own, they can do surprising things! From an image, we let a VLM code a 3D scene from scratch in Blender, and then render to verify/refine in a loop. Even when each step is imperfect, it gets results like this, with zero training.

Angjoo Kanazawa

51,928 просмотров • 6 месяцев назад

Elon Musk: We want to make it possible for anyone interested to move to Mars and help build a new civilization. “We want to ultimately make it so that anyone who wants to move to Mars and help build a new civilization can do so. Anyone out there. How cool would that be? And even if you don't want to do it, maybe you have a son or daughter who wants to do that or a friend who wants to do it. And I think it would be the best adventure that one could possibly do is to go and help build a new civilization on a new planet.” From: SpaceX Update, May 29, 2025

Elon Musk: We want to make it possible for anyone interested to move to Mars and help build a new civilization. “We want to ultimately make it so that anyone who wants to move to Mars and help build a new civilization can do so. Anyone out there. How cool would that be? And even if you don't want to do it, maybe you have a son or daughter who wants to do that or a friend who wants to do it. And I think it would be the best adventure that one could possibly do is to go and help build a new civilization on a new planet.” From: SpaceX Update, May 29, 2025

ELON CLIPS

72,179 просмотров • 1 год назад

Nikola Jokic postgame "We are not even close to where we're supposed to be. I think how bad we've played, we're in a good spot." "We need to start thinking what I can do for this team to help, not what the team can do to help me...We should point (the thumb not the finger)."

Nikola Jokic postgame "We are not even close to where we're supposed to be. I think how bad we've played, we're in a good spot." "We need to start thinking what I can do for this team to help, not what the team can do to help me...We should point (the thumb not the finger)."

Harrison Wind

314,329 просмотров • 1 год назад

📢📢 𝐀𝐯𝐚𝐭𝟑𝐫 📢📢 Avat3r creates high-quality 3D head avatars from just a few input images in a single forward pass with a new dynamic 3DGS reconstruction model. Video: Project: Our core idea is to make Gaussian Reconstruction Models animatable. We find that a simple cross-attention to an expression code sequence is already sufficient to model complex facial expressions. We then incorporate position maps from DUSt3R and feature maps from Sapiens to facilitate the prediction task. While DUSt3R's position maps act as a pixel-aligned initialization for the Gaussians' positions, the Sapiens feature maps help the cross-view transformer to match corresponding image tokens in the 4 input images. One major challenge in creating a 3D head avatar from smartphone images comes from inconsistent facial expressions when the subject could not remain perfectly static during the capture. We eliminate this static requirement by simply showing our model input images with different facial expressions during training. This technique makes our model robust to inconsistent input images later on. Finally, we show that despite the model has been trained with 4 input images, one can even create a 3D head avatar when only a single image is available. To achieve this, we employ a pre-trained 3D GAN to lift the single image to 3D and then render the 4 input images for our model. This allows us to create 3D head avatars from single images and even highly out-of-distribution examples like AI generated faces, paintings or statues. Great work by Tobias Kirschstein from his internship at Meta with Javier Romero, Artem Sevastopolsky, and Shunsuke Saito

Matthias Niessner

74,763 просмотров • 1 год назад

We want autistic people, with or without a diagnosis, to take our new survey. It is designed to help you tell us what support you need, so we can understand what works and how more can be put in place:

We want autistic people, with or without a diagnosis, to take our new survey. It is designed to help you tell us what support you need, so we can understand what works and how more can be put in place:

Autism Action

134,186 просмотров • 2 лет назад

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

🚀 New Paper: Pixel Reasoner 🧠🖼️ How can Vision-Language Models (VLMs) perform chain-of-thought reasoning within the image itself? We introduce Pixel Reasoner, the first open-source framework that enables VLMs to “think in pixel space” through curiosity-driven reinforcement learning. Current VLMs reason only in text — even when grounded in rich images or videos, their logical steps are verbalized in natural language. This restricts their ability to interrogate visual evidence and demonstrate how conclusions are drawn. 🔍 So we ask: What if we could make VLMs "show their work" by reasoning directly in the pixel space? Inspired by GPT-o3’s "think-in-image" ability, we propose a framework where VLMs use interactive visual operations — zoom, select-frame, highlight — to reason through complex visual inputs. To do this, we design a two-stage training process: Instruction tuning with synthesized visual reasoning traces. Reinforcement learning with curiosity-driven reward to balance exploration between pixel and text reasoning ✨ With this, Pixel Reasoner achieves near-SoTA performance on many information-rich multimodal benchmarks: 📊 84% on InfographicsVQA 🧠 84% on V* benchmark 🧩 74% on TallyQA-Complex It also achieves strong accuracy of 68% on MVBench (a video benchmark). Website: Paper: Code: Demo: (coming soon)

Wenhu Chen

82,829 просмотров • 1 год назад

(1/6) #SwissBorg Family 💚🚨 Cyrus SwissBorg told me to do this for YOU 🫶 Is this the MOST effective #tokenomics #utility framework you've seen? This 🧵 is to show a fact-based & proven framework that can 🚀 #CHSB and/or any token to the 🌙

(1/6) #SwissBorg Family 💚🚨 Cyrus SwissBorg told me to do this for YOU 🫶 Is this the MOST effective #tokenomics #utility framework you've seen? This 🧵 is to show a fact-based & proven framework that can 🚀 #CHSB and/or any token to the 🌙

Alex SwissBorg 💚

99,069 просмотров • 3 лет назад

#ICCV2025 🤩3D world generation is cool, but it is cooler to play with the worlds using 3D actions 👆💨, and see what happens! — Introducing *WonderPlay*: Now you can create dynamic 3D scenes that respond to your 3D actions from a single image! Web: 🧵1/7

#ICCV2025 🤩3D world generation is cool, but it is cooler to play with the worlds using 3D actions 👆💨, and see what happens! — Introducing WonderPlay: Now you can create dynamic 3D scenes that respond to your 3D actions from a single image! Web: 🧵1/7

Hong-Xing (Koven) Yu

57,796 просмотров • 1 год назад

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

📢📢 𝐏𝐞𝐫𝐜𝐇𝐞𝐚𝐝: 𝐏𝐞𝐫𝐜𝐞𝐩𝐭𝐮𝐚𝐥 𝐇𝐞𝐚𝐝 𝐌𝐨𝐝𝐞𝐥 𝐟𝐨𝐫 𝐒𝐢𝐧𝐠𝐥𝐞-𝐈𝐦𝐚𝐠𝐞 𝟑𝐃 𝐇𝐞𝐚𝐝 𝐑𝐞𝐜𝐨𝐧𝐬𝐭𝐫𝐮𝐜𝐭𝐢𝐨𝐧 & 𝐄𝐝𝐢𝐭𝐢𝐧𝐠📢📢 PercHead reconstructs realistic 3D heads from a single image and enables disentangled 3D editing via geometric controls and style inputs from images or text. At its core is a generalized 3D head decoder trained with perceptual supervision from DINOv2 and SAM 2.1. We find that our new perceptual loss formulation improves reconstruction fidelity compared to commonly-used methods such as LPIPS. Our trained reconstruction model is able to generate 3D-consistent heads from a single input image. Even with challenging side-view inputs, the model robustly infers missing regions for a coherent, high-fidelity output. In addition, our architecture seamlessly adapts to downstream tasks: by swapping the encoder, we can transform the model into a disentangled 3D editing pipeline. In this scenario, we can control geometry through - potentially hand-drawn - segmentation maps, and condition style via image or text prompt. We also provide an interactive GUI to enable the exploration of our editing pipeline. 🌍 📽️ Great work by Antonio Oroz and Tobias Kirschstein

Matthias Niessner

18,855 просмотров • 8 месяцев назад

We all will have expectations about how life is supposed to be and how people are supposed to be. But we can’t control the way that others act or whether they meet your expectations. So when we hold on to those expectations and they’re not met we create pain. What we can control is how we respond. We can choose to let go of expectation and open ourselves to something new. 🌱❤️ #DateWithDestiny2025

We all will have expectations about how life is supposed to be and how people are supposed to be. But we can’t control the way that others act or whether they meet your expectations. So when we hold on to those expectations and they’re not met we create pain. What we can control is how we respond. We can choose to let go of expectation and open ourselves to something new. 🌱❤️ #DateWithDestiny2025

Tony Robbins

32,161 просмотров • 7 месяцев назад