Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Gen2Act: Casting language-conditioned manipulation as human video generation followed by closed-loop policy execution conditioned on the generated video enables solving diverse real-world tasks unseen in the robot dataset! 1/n

Homanga Bharadhwaj @ CVPR

3,036 subscribers

71,132 görüntüleme • 1 yıl önce •via X (Twitter)

Eğitim Bilim & Teknoloji

Anya Rossi• Live Now

Private livecam show

10 Yorum

Homanga Bharadhwaj profil fotoğrafı

Homanga Bharadhwaj1 yıl önce

We opt for generating human videos because we find that current best video models (e.g. VideoPoet) are already good at generating human videos *zero-shot* given an image of a scene and a language description of a task. This doesn't require any fine-tuning/adaption! 2/n

Homanga Bharadhwaj profil fotoğrafı

Homanga Bharadhwaj1 yıl önce

The video model generalizes well to new scenarios by virtue of web-scale training The policy also generalizes to tasks beyond that in the robot data as it is tasked with a much simpler job of translating the generated video to actions by following motion cues from the video 3/n

Homanga Bharadhwaj profil fotoğrafı

Homanga Bharadhwaj1 yıl önce

We can also chain Gen2Act for long-horizon activities with multiple tasks by sequentially rolling out video generation and policy execution conditioned on the generated video. 4/n

Homanga Bharadhwaj profil fotoğrafı

Homanga Bharadhwaj1 yıl önce

Following prior works, we categorize results with respect to different levels of generalization. Gen2Act achieves non-trivial success rates (30-60%) for even the challenging categories of motion-type and object-type generalization 5/n

Homanga Bharadhwaj profil fotoğrafı

Homanga Bharadhwaj1 yıl önce

This was a fun project w/ @debidatta @gupta_abhinav_ @shubhtuls @CarlDoersch @shahdhruv_ @xiao_ted @SeanKirmani @xf1280 @DorsaSadigh @GoogleDeepMind @CMU_Robotics @StanfordAILab More details: Video: n/n

Samarth Sinha profil fotoğrafı

Samarth Sinha1 yıl önce

Congrats Homanga!!

Jason Ma profil fotoğrafı

Jason Ma1 yıl önce

Excited to see this out, congrats Homanga!

Paweł Budzianowski profil fotoğrafı

Paweł Budzianowski1 yıl önce

Great to see first video-based model employed! This opens up completely new category of possibilities!

Jay Vakil profil fotoğrafı

Jay Vakil1 yıl önce

Amazing work @mangahomanga

Rui Chen profil fotoğrafı

Rui Chen1 yıl önce

Great work! Human video is a useful and unlimited source for mainpulation.

Benzer Videolar

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

Wenlong Huang

105,054 görüntüleme • 2 ay önce

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,392 görüntüleme • 1 yıl önce

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

The Humanoid Hub

17,721 görüntüleme • 4 ay önce

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Jianlan Luo

95,100 görüntüleme • 5 ay önce

Imitation learning has a data scarcity problem. Introducing EgoDex from Apple, the largest and most diverse dataset of dexterous human manipulation to date — 829 hours of egocentric video + paired 3D hand poses across 194 tasks. Now on arxiv: (1/4)

Imitation learning has a data scarcity problem. Introducing EgoDex from Apple, the largest and most diverse dataset of dexterous human manipulation to date — 829 hours of egocentric video + paired 3D hand poses across 194 tasks. Now on arxiv: (1/4)

Ryan Hoque

113,849 görüntüleme • 1 yıl önce

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Alexander Koch

437,791 görüntüleme • 2 yıl önce

Can robots self-improve by collecting data autonomously🤖? Introducing SOAR: a system for large-scale autonomous data collection 🚀 and autonomous improvement📈of a multi-task language-conditioned policy in diverse scenes without human interventions .

Can robots self-improve by collecting data autonomously🤖? Introducing SOAR: a system for large-scale autonomous data collection 🚀 and autonomous improvement📈of a multi-task language-conditioned policy in diverse scenes without human interventions .

Paul Zhou

47,667 görüntüleme • 1 yıl önce

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Xiaochuang Han

28,925 görüntüleme • 6 ay önce

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

Rhoda AI

68,050 görüntüleme • 3 ay önce

Perceptive Humanoid Parkour (PHP) introduces a modular framework that enables the Unitree G1 humanoid to perform long-horizon, vision-based parkour. - It chains retargeted human motion clips into diverse, long-horizon kinematic reference trajectories. - RL expert policies learn individual skills that are distilled into a depth-conditioned student policy. - The robot autonomously selects the appropriate skill based on the obstacle geometry.

Perceptive Humanoid Parkour (PHP) introduces a modular framework that enables the Unitree G1 humanoid to perform long-horizon, vision-based parkour. - It chains retargeted human motion clips into diverse, long-horizon kinematic reference trajectories. - RL expert policies learn individual skills that are distilled into a depth-conditioned student policy. - The robot autonomously selects the appropriate skill based on the obstacle geometry.

The Humanoid Hub

60,337 görüntüleme • 3 ay önce

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Yunzhu Li

16,531 görüntüleme • 11 ay önce

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Yilun Du

21,392 görüntüleme • 1 yıl önce

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

Siyuan Huang

22,291 görüntüleme • 2 yıl önce

TidyBot: Personalized Robot Assistance with Large Language Models approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios abs: project page: paper page:

TidyBot: Personalized Robot Assistance with Large Language Models approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios abs: project page: paper page:

AK

325,986 görüntüleme • 3 yıl önce

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

Bandai Namco US

41,554 görüntüleme • 1 yıl önce

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

BLEACH Rebirth of Souls

290,184 görüntüleme • 1 yıl önce

Introducing Yell At Your Robot (YAY Robot!) 🗣️- a fun collaboration b/w Stanford University and UC Berkeley 🤖 We enable robots to improve on-the-fly from language corrections: robots rapidly adapt in real-time and continuously improve from human verbal feedback. YAY Robot enables long-horizon, dexterous manipulation tasks like preparing trail-mix, packing a ziploc bag, and cleaning dishes:

Introducing Yell At Your Robot (YAY Robot!) 🗣️- a fun collaboration b/w Stanford University and UC Berkeley 🤖 We enable robots to improve on-the-fly from language corrections: robots rapidly adapt in real-time and continuously improve from human verbal feedback. YAY Robot enables long-horizon, dexterous manipulation tasks like preparing trail-mix, packing a ziploc bag, and cleaning dishes:

Lucy Shi

122,774 görüntüleme • 2 yıl önce

Goal-conditioned RL (GCRL) is great - unsupervised, can use data (in offline mode), flexibility to define tasks at test time. But can we run GCRL on *language data*?? In our new work we show that language GCRL enables sophisticated test-time reasoning for interactive tasks! 🧵👇

Goal-conditioned RL (GCRL) is great - unsupervised, can use data (in offline mode), flexibility to define tasks at test time. But can we run GCRL on language data?? In our new work we show that language GCRL enables sophisticated test-time reasoning for interactive tasks! 🧵👇

Sergey Levine

18,782 görüntüleme • 1 yıl önce

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Murtaza Dalal

114,538 görüntüleme • 1 yıl önce

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

Xiaolong Wang

56,268 görüntüleme • 2 yıl önce