Загрузка видео...

Не удалось загрузить видео

Возникла проблема при загрузке этого видео. Это может быть связано с временными проблемами сети или видео может быть недоступно.

На главную

Gen2Act: Casting language-conditioned manipulation as human video generation followed by closed-loop policy execution conditioned on the generated video enables solving diverse real-world tasks unseen in the robot dataset! 1/n

Homanga Bharadhwaj @ CVPR

3,036 subscribers

71,132 просмотров • 1 год назад •via X (Twitter)

Образование Наука и технологии

Anya Rossi• Live Now

Private livecam show

Комментарии: 10

Фото профиля Homanga Bharadhwaj

Homanga Bharadhwaj1 год назад

We opt for generating human videos because we find that current best video models (e.g. VideoPoet) are already good at generating human videos *zero-shot* given an image of a scene and a language description of a task. This doesn't require any fine-tuning/adaption! 2/n

Фото профиля Homanga Bharadhwaj

Homanga Bharadhwaj1 год назад

The video model generalizes well to new scenarios by virtue of web-scale training The policy also generalizes to tasks beyond that in the robot data as it is tasked with a much simpler job of translating the generated video to actions by following motion cues from the video 3/n

Фото профиля Homanga Bharadhwaj

Homanga Bharadhwaj1 год назад

We can also chain Gen2Act for long-horizon activities with multiple tasks by sequentially rolling out video generation and policy execution conditioned on the generated video. 4/n

Фото профиля Homanga Bharadhwaj

Homanga Bharadhwaj1 год назад

Following prior works, we categorize results with respect to different levels of generalization. Gen2Act achieves non-trivial success rates (30-60%) for even the challenging categories of motion-type and object-type generalization 5/n

Фото профиля Homanga Bharadhwaj

Homanga Bharadhwaj1 год назад

This was a fun project w/ @debidatta @gupta_abhinav_ @shubhtuls @CarlDoersch @shahdhruv_ @xiao_ted @SeanKirmani @xf1280 @DorsaSadigh @GoogleDeepMind @CMU_Robotics @StanfordAILab More details: Video: n/n

Фото профиля Samarth Sinha

Samarth Sinha1 год назад

Congrats Homanga!!

Фото профиля Jason Ma

Jason Ma1 год назад

Excited to see this out, congrats Homanga!

Фото профиля Paweł Budzianowski

Paweł Budzianowski1 год назад

Great to see first video-based model employed! This opens up completely new category of possibilities!

Фото профиля Jay Vakil

Jay Vakil1 год назад

Amazing work @mangahomanga

Фото профиля Rui Chen

Rui Chen1 год назад

Great work! Human video is a useful and unlimited source for mainpulation.

Похожие видео

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

Wenlong Huang

105,429 просмотров • 2 месяцев назад

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,392 просмотров • 1 год назад

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

The Humanoid Hub

17,721 просмотров • 4 месяцев назад

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Jianlan Luo

95,100 просмотров • 5 месяцев назад

Imitation learning has a data scarcity problem. Introducing EgoDex from Apple, the largest and most diverse dataset of dexterous human manipulation to date — 829 hours of egocentric video + paired 3D hand poses across 194 tasks. Now on arxiv: (1/4)

Imitation learning has a data scarcity problem. Introducing EgoDex from Apple, the largest and most diverse dataset of dexterous human manipulation to date — 829 hours of egocentric video + paired 3D hand poses across 194 tasks. Now on arxiv: (1/4)

Ryan Hoque

113,849 просмотров • 1 год назад

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Alexander Koch

437,791 просмотров • 2 лет назад

Can robots self-improve by collecting data autonomously🤖? Introducing SOAR: a system for large-scale autonomous data collection 🚀 and autonomous improvement📈of a multi-task language-conditioned policy in diverse scenes without human interventions .

Can robots self-improve by collecting data autonomously🤖? Introducing SOAR: a system for large-scale autonomous data collection 🚀 and autonomous improvement📈of a multi-task language-conditioned policy in diverse scenes without human interventions .

Paul Zhou

47,667 просмотров • 1 год назад

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Xiaochuang Han

28,925 просмотров • 6 месяцев назад

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

Rhoda AI

68,121 просмотров • 3 месяцев назад

Perceptive Humanoid Parkour (PHP) introduces a modular framework that enables the Unitree G1 humanoid to perform long-horizon, vision-based parkour. - It chains retargeted human motion clips into diverse, long-horizon kinematic reference trajectories. - RL expert policies learn individual skills that are distilled into a depth-conditioned student policy. - The robot autonomously selects the appropriate skill based on the obstacle geometry.

Perceptive Humanoid Parkour (PHP) introduces a modular framework that enables the Unitree G1 humanoid to perform long-horizon, vision-based parkour. - It chains retargeted human motion clips into diverse, long-horizon kinematic reference trajectories. - RL expert policies learn individual skills that are distilled into a depth-conditioned student policy. - The robot autonomously selects the appropriate skill based on the obstacle geometry.

The Humanoid Hub

60,337 просмотров • 3 месяцев назад

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Yunzhu Li

16,531 просмотров • 11 месяцев назад

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Introducing an approach to directly ground video generation models to policy execution without needing any action labels! Our approach uses a generic goal-conditioned exploration procedure to learn a policy that works across robots / embodiments!

Yilun Du

21,392 просмотров • 1 год назад

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

Siyuan Huang

22,291 просмотров • 2 лет назад

TidyBot: Personalized Robot Assistance with Large Language Models approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios abs: project page: paper page:

TidyBot: Personalized Robot Assistance with Large Language Models approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios abs: project page: paper page:

AK

325,986 просмотров • 3 лет назад

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

Bandai Namco US

41,554 просмотров • 1 год назад

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

Character Gameplay #21 Overwhelming power in conditioned state made by the diverse changes during the fight. #BLEACHRoS

BLEACH Rebirth of Souls

290,184 просмотров • 1 год назад

Introducing Yell At Your Robot (YAY Robot!) 🗣️- a fun collaboration b/w Stanford University and UC Berkeley 🤖 We enable robots to improve on-the-fly from language corrections: robots rapidly adapt in real-time and continuously improve from human verbal feedback. YAY Robot enables long-horizon, dexterous manipulation tasks like preparing trail-mix, packing a ziploc bag, and cleaning dishes:

Introducing Yell At Your Robot (YAY Robot!) 🗣️- a fun collaboration b/w Stanford University and UC Berkeley 🤖 We enable robots to improve on-the-fly from language corrections: robots rapidly adapt in real-time and continuously improve from human verbal feedback. YAY Robot enables long-horizon, dexterous manipulation tasks like preparing trail-mix, packing a ziploc bag, and cleaning dishes:

Lucy Shi

122,774 просмотров • 2 лет назад

Goal-conditioned RL (GCRL) is great - unsupervised, can use data (in offline mode), flexibility to define tasks at test time. But can we run GCRL on *language data*?? In our new work we show that language GCRL enables sophisticated test-time reasoning for interactive tasks! 🧵👇

Goal-conditioned RL (GCRL) is great - unsupervised, can use data (in offline mode), flexibility to define tasks at test time. But can we run GCRL on language data?? In our new work we show that language GCRL enables sophisticated test-time reasoning for interactive tasks! 🧵👇

Sergey Levine

18,782 просмотров • 1 год назад

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Murtaza Dalal

114,538 просмотров • 1 год назад

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

Xiaolong Wang

56,268 просмотров • 2 лет назад