Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

Gen2Act: Casting language-conditioned manipulation as human video generation followed by closed-loop policy execution conditioned on the generated video enables solving diverse real-world tasks unseen in the robot dataset! 1/n

Homanga Bharadhwaj

3,054 subscribers

71,154 Aufrufe • vor 1 Jahr •via X (Twitter)

Bildung Wissenschaft & Technologie

Anya Rossi• Live Now

Private livecam show

10 Kommentare

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

We opt for generating human videos because we find that current best video models (e.g. VideoPoet) are already good at generating human videos *zero-shot* given an image of a scene and a language description of a task. This doesn't require any fine-tuning/adaption! 2/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

The video model generalizes well to new scenarios by virtue of web-scale training The policy also generalizes to tasks beyond that in the robot data as it is tasked with a much simpler job of translating the generated video to actions by following motion cues from the video 3/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

We can also chain Gen2Act for long-horizon activities with multiple tasks by sequentially rolling out video generation and policy execution conditioned on the generated video. 4/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

Following prior works, we categorize results with respect to different levels of generalization. Gen2Act achieves non-trivial success rates (30-60%) for even the challenging categories of motion-type and object-type generalization 5/n

Profilbild von Homanga Bharadhwaj

Homanga Bharadhwajvor 1 Jahr

This was a fun project w/ @debidatta @gupta_abhinav_ @shubhtuls @CarlDoersch @shahdhruv_ @xiao_ted @SeanKirmani @xf1280 @DorsaSadigh @GoogleDeepMind @CMU_Robotics @StanfordAILab More details: Video: n/n

Profilbild von Samarth Sinha

Samarth Sinhavor 1 Jahr

Congrats Homanga!!

Profilbild von Jason Ma

Jason Mavor 1 Jahr

Excited to see this out, congrats Homanga!

Profilbild von Paweł Budzianowski

Paweł Budzianowskivor 1 Jahr

Great to see first video-based model employed! This opens up completely new category of possibilities!

Profilbild von Jay Vakil

Jay Vakilvor 1 Jahr

Amazing work @mangahomanga

Profilbild von Rui Chen

Rui Chenvor 1 Jahr

Great work! Human video is a useful and unlimited source for mainpulation.

Ähnliche Videos

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

What representation enables open-world robot manipulation from generated videos? Introducing Dream2Flow, our recent work that bridges video generation and robot control with 3D object flow. Stanford University #ICRA2026 1/N

Wenlong Huang

106,064 Aufrufe • vor 4 Monaten

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Vision-language models perform diverse tasks via in-context learning. Time for robots to do the same! Introducing In-Context Robot Transformer (ICRT): a robot policy that learns new tasks by prompting with robot trajectories, without any fine-tuning. [1/N]

Max Fu

40,451 Aufrufe • vor 1 Jahr

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

World Model meets robot policy! Robbyant's LingBot-VA: unifies video world modeling and robotic policy learning. - A single model generates both future video and the actions to make it real. - Long-term memory enables long-horizon tasks. - Claims significant outperformance over π₀.₅ in real-world tasks. - It's open-source

The Humanoid Hub

17,721 Aufrufe • vor 5 Monaten

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Jianlan Luo

95,100 Aufrufe • vor 6 Monaten

Imitation learning has a data scarcity problem. Introducing EgoDex from Apple, the largest and most diverse dataset of dexterous human manipulation to date — 829 hours of egocentric video + paired 3D hand poses across 194 tasks. Now on arxiv: (1/4)

Imitation learning has a data scarcity problem. Introducing EgoDex from Apple, the largest and most diverse dataset of dexterous human manipulation to date — 829 hours of egocentric video + paired 3D hand poses across 194 tasks. Now on arxiv: (1/4)

Ryan Hoque

114,164 Aufrufe • vor 1 Jahr

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Excited to announce Tau Robotics (Tau Robotics). We are building a general AI for robots. We start by building millions of robot arms that learn in the real world. In the video, two robot arms are fully autonomous and controlled by a single neural network conditioned on different language instructions (four axes and five axes robot arms). The other two arms are teleoperated. The entire hardware cost in the video is about $1400. The video is at 1.5x speed.

Alexander Koch

437,856 Aufrufe • vor 2 Jahren

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Can we simplify video generation by decomposing it into interleaved text-video co-generation? Would explicit, repeated thinking in language improve generation in pixels? We introduce TV2TV: a unified model that jointly learns - language modeling (next-token prediction) - video flow matching (next-frame prediction) At inference, TV2TV dynamically alternates between textual thinking and video generation. Model generations below: interleaved text plans and video slices (~1–2s) are co-generated over time, conditioned on a single frame per sport. 📖

Xiaochuang Han

31,749 Aufrufe • vor 7 Monaten

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

To bring generalist intelligent robots to the real world, we have to overcome the data scarcity problem. At Rhoda, we are solving it by reformulating robot policies as video generation. Today, we introduce the Direct Video-Action Model (DVA)

Rhoda AI

70,219 Aufrufe • vor 4 Monaten

Perceptive Humanoid Parkour (PHP) introduces a modular framework that enables the Unitree G1 humanoid to perform long-horizon, vision-based parkour. - It chains retargeted human motion clips into diverse, long-horizon kinematic reference trajectories. - RL expert policies learn individual skills that are distilled into a depth-conditioned student policy. - The robot autonomously selects the appropriate skill based on the obstacle geometry.

Perceptive Humanoid Parkour (PHP) introduces a modular framework that enables the Unitree G1 humanoid to perform long-horizon, vision-based parkour. - It chains retargeted human motion clips into diverse, long-horizon kinematic reference trajectories. - RL expert policies learn individual skills that are distilled into a depth-conditioned student policy. - The robot autonomously selects the appropriate skill based on the obstacle geometry.

The Humanoid Hub

61,055 Aufrufe • vor 5 Monaten

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Is VideoGen starting to become good enough for robotic manipulation? 🤖 Check out our recent work, RIGVid — Robots Imitating Generated Videos — where we use AI-generated videos as intermediate representations and 6-DoF motion retargeting to guide robots in diverse manipulation tasks: pouring, wiping, mixing, and more. 🔗 Key takeaways: - VideoGen starts to become good enough for robotics - As the field progresses, we are expecting much better results in the coming years - Depending on whether video prediction models take actions or not (VideoGen vs Action-Conditioned Video Prediction), there are different ways to use them. - Controllability & steerability are still issues In the paper, we explore: – How do different video generation models compare for robotic imitation? – Can generated videos replace real videos for imitation? – What causes failure of imitation given high-quality videos? – How does imitating from video compare with other representations (e.g., keypoint constraints like ReKep)? 🎥 Watch the video for (1) AI-generated inputs, (2) robot executions, and (3) the 3D intermediate representation bridging the embodiment gap.

Yunzhu Li

16,540 Aufrufe • vor 1 Jahr

TidyBot: Personalized Robot Assistance with Large Language Models approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios abs: project page: paper page:

TidyBot: Personalized Robot Assistance with Large Language Models approach enables fast adaptation and achieves 91.2% accuracy on unseen objects in our benchmark dataset. We also demonstrate our approach on a real-world mobile manipulator called TidyBot, which successfully puts away 85.0% of objects in real-world test scenarios abs: project page: paper page:

AK

326,009 Aufrufe • vor 3 Jahren

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

🎤🎤 Excited to introduce COME-robot🤖🤖, Closed-Loop Open-Vocabulary Mobile Manipulation with GPT-4V. It is the first closed-loop framework utilizing the vision-language foundation model for open-ended reasoning and adaptive planning in real-world scenarios. COME-robot demonstrates a significant improvement in task success rate (~25%) compared to SOTA methods. Project: Arxiv:

Siyuan Huang

22,291 Aufrufe • vor 2 Jahren

Introducing Yell At Your Robot (YAY Robot!) 🗣️- a fun collaboration b/w Stanford University and UC Berkeley 🤖 We enable robots to improve on-the-fly from language corrections: robots rapidly adapt in real-time and continuously improve from human verbal feedback. YAY Robot enables long-horizon, dexterous manipulation tasks like preparing trail-mix, packing a ziploc bag, and cleaning dishes:

Introducing Yell At Your Robot (YAY Robot!) 🗣️- a fun collaboration b/w Stanford University and UC Berkeley 🤖 We enable robots to improve on-the-fly from language corrections: robots rapidly adapt in real-time and continuously improve from human verbal feedback. YAY Robot enables long-horizon, dexterous manipulation tasks like preparing trail-mix, packing a ziploc bag, and cleaning dishes:

Lucy Shi

122,827 Aufrufe • vor 2 Jahren

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Can a single neural network policy generalize over poses, objects, obstacles, backgrounds, scene arrangements, in-hand objects, and start/goal states? Introducing Neural MP: A generalist policy for solving motion planning tasks in the real world 🤖 1/N

Murtaza Dalal

114,538 Aufrufe • vor 1 Jahr

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

We release Cosmos Policy 💫: a state-of-the-art robot policy built on a video diffusion model backbone. - policy + world model + value function — in 1 model - no architectural changes to the base video model - SOTA in LIBERO (98.5%), RoboCasa (67.1%), & ALOHA tasks (93.6%) 🧵👇

Moo Jin Kim

149,398 Aufrufe • vor 5 Monaten

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

🤖What if a robot could perform a new task just from a natural language command, with zero demonstrations? Our new work, NovaFlow, makes it possible! We use pre-trained video generative model to create a video of the task, then translate it into a plan for real-world robot execution. 1/6 #Robotics #AI #ZeroShot #Manipulation

Hongyu Li

105,571 Aufrufe • vor 9 Monaten

Diffusion has shown great promise for generating robot **actions**, can it act as a **world model** to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a **controllable** action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - **visual** trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Diffusion has shown great promise for generating robot actions, can it act as a world model to generate the future conditioned on actions? In our work led by han qi Haocheng Yin and in collaboration with Yilun Du, we show a controllable action-conditioned video diffusion model can produce photorealistic and (near) physics-accurate future predictions. This ability strengthens the policy via: - ranking different action proposals and selecting the best, or - visual trajectory optimization by optimizing the action proposals using gradient ascent. Learn more about Generative Predictive Control (GPC) at:

Heng Yang

38,428 Aufrufe • vor 1 Jahr

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Qwen-VLA feels like one of the first real robotics foundation models. A single system trained across robot manipulation, navigation, egocentric human video, simulation, and vision-language reasoning instead of isolated robot policies.

Robots Digest 🤖

14,663 Aufrufe • vor 1 Monat

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

DisCo: Disentangled Control for Referring Human Dance Generation in Real World paper page: Generative AI has made significant strides in computer vision, particularly in image/video synthesis conditioned on text descriptions. Despite the advancements, it remains challenging especially in the generation of human-centric content such as dance synthesis. Existing dance synthesis methods struggle with the gap between synthesized content and real-world dance scenarios. In this paper, we define a new problem setting: Referring Human Dance Generation, which focuses on real-world dance scenarios with three important properties: (i) Faithfulness: the synthesis should retain the appearance of both human subject foreground and background from the reference image, and precisely follow the target pose; (ii) Generalizability: the model should generalize to unseen human subjects, backgrounds, and poses; (iii) Compositionality: it should allow for composition of seen/unseen subjects, backgrounds, and poses from different sources. To address these challenges, we introduce a novel approach, DISCO, which includes a novel model architecture with disentangled control to improve the faithfulness and compositionality of dance synthesis, and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions.

AK

161,453 Aufrufe • vor 3 Jahren

Boston Dynamics has released a new video of their next-generation humanoid robot, Atlas. "We are excited to share some of our progress on developing Large Behavior Models for Atlas. This work is part of a collaboration between AI research teams at Toyota Research Institute (TRI) and Boston Dynamics. We have been building end-to-end language-conditioned policies that enable Atlas to accomplish long-horizon manipulation tasks."

Boston Dynamics has released a new video of their next-generation humanoid robot, Atlas. "We are excited to share some of our progress on developing Large Behavior Models for Atlas. This work is part of a collaboration between AI research teams at Toyota Research Institute (TRI) and Boston Dynamics. We have been building end-to-end language-conditioned policies that enable Atlas to accomplish long-horizon manipulation tasks."

Sawyer Merritt

200,744 Aufrufe • vor 11 Monaten