Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Synchronize Dual Hands for Physics-Based Dexterous Guitar Playing discuss: We present a novel approach to synthesize dexterous motions for physically simulated hands in tasks that require coordination between the control of two hands with high temporal precision. Instead of directly learning a joint policy to control two hands, our... approach performs bimanual control through cooperative learning where each hand is treated as an individual agent. The individual policies for each hand are first trained separately, and then synchronized through latent space manipulation in a centralized environment to serve as a joint policy for two-hand control. By doing so, we avoid directly performing policy learning in the joint state-action space of two hands with higher dimensions, greatly improving the overall training efficiency. We demonstrate the effectiveness of our proposed approach in the challenging guitar-playing task. The virtual guitarist trained by our approach can synthesize motions from unstructured reference data of general guitar-playing practice motions, and accurately play diverse rhythms with complex chord pressing and string picking patterns based on the input guitar tabs that do not exist in the references. Along with this paper, we provide the motion capture data that we collected as the reference for policy training.show more

AK

475,764 subscribers

26,855 görüntüleme • 1 yıl önce •via X (Twitter)

Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

2 Yorum

Tommyedz AΩ profil fotoğrafı

Tommyedz AΩ1 yıl önce

@W4nkpire

StudioGaltMocap profil fotoğrafı

StudioGaltMocap1 yıl önce

Looks cool. But I am not a guitar person, anyone know if it accurate?

Benzer Videolar

Physics-based Motion Retargeting from Sparse Inputs paper page: Avatars are important to create interactive and immersive experiences in virtual worlds. One challenge in animating these characters to mimic a user's motion is that commercial AR/VR products consist only of a headset and controllers, providing very limited sensor data of the user's pose. Another challenge is that an avatar might have a different skeleton structure than a human and the mapping between them is unclear. In this work we address both of these challenges. We introduce a method to retarget motions in real-time from sparse human sensor data to characters of various morphologies. Our method uses reinforcement learning to train a policy to control characters in a physics simulator. We only require human motion capture data for training, without relying on artist-generated animations for each avatar. This allows us to use large motion capture datasets to train general policies that can track unseen users from real and sparse data in real-time. We demonstrate the feasibility of our approach on three characters with different skeleton structure: a dinosaur, a mouse-like creature and a human. We show that the avatar poses often match the user surprisingly well, despite having no sensor information of the lower body available. We discuss and ablate the important components in our framework, specifically the kinematic retargeting step, the imitation, contact and action reward as well as our asymmetric actor-critic observations. We further explore the robustness of our method in a variety of settings including unbalancing, dancing and sports motions.

AK

106,527 görüntüleme • 3 yıl önce

You might have seen the WuBOT performing at the 2026 Spring Festival Gala; however, most high-dynamic extreme motions you see are executed by overfitted tracking policies. Until now, training a unified policy capable of performing various extreme motions with a high success rate remained an unsolved challenge. We spent an entire year digging into the barrier between general tracking and extreme physical behaviors. After burning through dozens of G1 robots, we finally identified the bottleneck of learning and physical executability. With these discoveries, we developed OmniXtreme: the first general policy that can execute diverse extreme motions, including consecutive flips, extreme balancing, and even breakdancing with rapid contact switches! This capability is achieved by pre-training a flow-based generative control policy and then post-training with actuation-aware residual RL for complex physical dynamics—a step we found critical for successful real-world transfer. This work is a joint collaboration with Unitree. Together, we are pushing the physical limits of humanoid robots. It is incredibly exciting to see a general "robot gymnast" and "robot breakdancer" come to life! It was also our first time publishing a paper with XingXing, which was an enlightening experience. The model checkpoints are now released—we welcome you to play with them! 📦 📄 Paper: 🌐 Project: 💻 Code:

You might have seen the WuBOT performing at the 2026 Spring Festival Gala; however, most high-dynamic extreme motions you see are executed by overfitted tracking policies. Until now, training a unified policy capable of performing various extreme motions with a high success rate remained an unsolved challenge. We spent an entire year digging into the barrier between general tracking and extreme physical behaviors. After burning through dozens of G1 robots, we finally identified the bottleneck of learning and physical executability. With these discoveries, we developed OmniXtreme: the first general policy that can execute diverse extreme motions, including consecutive flips, extreme balancing, and even breakdancing with rapid contact switches! This capability is achieved by pre-training a flow-based generative control policy and then post-training with actuation-aware residual RL for complex physical dynamics—a step we found critical for successful real-world transfer. This work is a joint collaboration with Unitree. Together, we are pushing the physical limits of humanoid robots. It is incredibly exciting to see a general "robot gymnast" and "robot breakdancer" come to life! It was also our first time publishing a paper with XingXing, which was an enlightening experience. The model checkpoints are now released—we welcome you to play with them! 📦 📄 Paper: 🌐 Project: 💻 Code:

Siyuan Huang

107,090 görüntüleme • 5 ay önce

Tesla Optimus can arrange batteries in their factories, ours can do skincare (on Yuzhe Qin)! We opensource Bunny-VisionPro, a teleoperation system for bimanual hand manipulation. The users can control the robot hands in real time using VisionPro, flexible like a bunny. 🐇 We also have kitchen tasks, playing Rubik's Cube, and dynamic motion tasks. Imitation learning policies are trained on sweeping with a broom, serving a drink, and wiping glasses. Check our website for more details: The project is led by Runyu Ding Runyu Ding, Yuzhe Qin Yuzhe Qin , and Jiyue Zhu

Tesla Optimus can arrange batteries in their factories, ours can do skincare (on Yuzhe Qin)! We opensource Bunny-VisionPro, a teleoperation system for bimanual hand manipulation. The users can control the robot hands in real time using VisionPro, flexible like a bunny. 🐇 We also have kitchen tasks, playing Rubik's Cube, and dynamic motion tasks. Imitation learning policies are trained on sweeping with a broom, serving a drink, and wiping glasses. Check our website for more details: The project is led by Runyu Ding Runyu Ding, Yuzhe Qin Yuzhe Qin , and Jiyue Zhu

Xiaolong Wang

90,938 görüntüleme • 2 yıl önce

Another robot from Disney! 🕺🏻 Just look at this dancing fella created by Disney Research. Creating control policies that work on real robots and handling diverse, unseen motions in physics-based character control is still difficult. The Disney team proposes a two-state solution. The first step involves learning a latent space encoding with an autoencoder. After encoding, the policy is trained to map kinematic input to dynamic output, ensuring accurate and adaptable movement. This method eliminates common mode collapse issues and improves motion encoding by separating these stages. This technique has proven effective in real robots in simulations, marking a significant leap forward in robot control. Can't wait for it in action live in 2025 :) ~~~ RT to help 1 robot find a new workplace.

Another robot from Disney! 🕺🏻 Just look at this dancing fella created by Disney Research. Creating control policies that work on real robots and handling diverse, unseen motions in physics-based character control is still difficult. The Disney team proposes a two-state solution. The first step involves learning a latent space encoding with an autoencoder. After encoding, the policy is trained to map kinematic input to dynamic output, ensuring accurate and adaptable movement. This method eliminates common mode collapse issues and improves motion encoding by separating these stages. This technique has proven effective in real robots in simulations, marking a significant leap forward in robot control. Can't wait for it in action live in 2025 :) ~~~ RT to help 1 robot find a new workplace.

Lukas Ziegler

65,955 görüntüleme • 1 yıl önce

Agile Continuous Jumping in Discontinuous Terrains discuss: We focus on agile, continuous, and terrain-adaptive jumping of quadrupedal robots in discontinuous terrains such as stairs and stepping stones. Unlike single-step jumping, continuous jumping requires accurately executing highly dynamic motions over long horizons, which is challenging for existing approaches. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Our framework enables a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step staircase in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities.

Agile Continuous Jumping in Discontinuous Terrains discuss: We focus on agile, continuous, and terrain-adaptive jumping of quadrupedal robots in discontinuous terrains such as stairs and stepping stones. Unlike single-step jumping, continuous jumping requires accurately executing highly dynamic motions over long horizons, which is challenging for existing approaches. To accomplish this task, we design a hierarchical learning and control framework, which consists of a learned heightmap predictor for robust terrain perception, a reinforcement-learning-based centroidal-level motion policy for versatile and terrain-adaptive planning, and a low-level model-based leg controller for accurate motion tracking. In addition, we minimize the sim-to-real gap by accurately modeling the hardware characteristics. Our framework enables a Unitree Go1 robot to perform agile and continuous jumps on human-sized stairs and sparse stepping stones, for the first time to the best of our knowledge. In particular, the robot can cross two stair steps in each jump and completes a 3.5m long, 2.8m high, 14-step staircase in 4.5 seconds. Moreover, the same policy outperforms baselines in various other parkour tasks, such as jumping over single horizontal or vertical discontinuities.

AK

35,794 görüntüleme • 1 yıl önce

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Scaling vision-language-action (VLA) models to high-DoF dexterous hands has long been a "holy grail" challenge due to the high-dimensional action space and data scarcity. As a wrap up of the year 2025, we are releasing GR-Dexter, a holistic hardware-model-data framework for generalist manipulation on a bimanual dexterous-hand robot. This is the first VLA system to achieve: ✅ High-DoF Control: Managing a 56-DoF bimanual system (21-DoF per hand). ✅ Long-Horizon Tasks with tool use: Vacuuming, bread serving with tongs, and table decluttering. ✅ Open-World Generalization: Robust performance with unseen objects and abstract instructions. Project page: ArXiv:

Xiao Ma

93,908 görüntüleme • 7 ay önce

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a *single* teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

We trained a humanoid with 22-DoF dexterous hands to assemble model cars, operate syringes, sort poker cards, fold/roll shirts, all learned primarily from 20,000+ hours of egocentric human video with no robot in the loop. Humans are the most scalable embodiment on the planet. We discovered a near-perfect log-linear scaling law (R² = 0.998) between human video volume and action prediction loss, and this loss directly predicts real-robot success rate. Humanoid robots will be the end game, because they are the practical form factor with minimal embodiment gap from humans. Call it the Bitter Lesson of robot hardware: the kinematic similarity lets us simply retarget human finger motion onto dexterous robot hand joints. No learned embeddings, no fancy transfer algorithms needed. Relative wrist motion + retargeted 22-DoF finger actions serve as a unified action space that carries through from pre-training to robot execution. Our recipe is called "EgoScale": - Pre-train GR00T N1.5 on 20K hours of human video, mid-train with only 4 hours (!) of robot play data with Sharpa hands. 54% gains over training from scratch across 5 highly dexterous tasks. - Most surprising result: a single teleop demo is sufficient to learn a never-before-seen task. Our recipe enables extreme data efficiency. - Although we pre-train in 22-DoF hand joint space, the policy transfers to a Unitree G1 with 7-DoF tri-finger hands. 30%+ gains over training on G1 data alone. The scalable path to robot dexterity was never more robots. It was always us. Deep dives in thread:

Jim Fan

293,961 görüntüleme • 5 ay önce

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation paper page: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts.

Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation paper page: Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts.

AK

126,585 görüntüleme • 2 yıl önce

📢📢📢 Excited to release ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning (CVPR25). 🤏🤙✌️With ManipTrans, we can transfer dexterous manipulation skills into robotic hands in simulation and deploy them on a real robot, using a residual policy learned for dex manipulation. 🤖🤖🤖The video below illustrates how the MoCap data can be transferred to Inspire, Shadow, Xhand, Allegro, and Mano. With ManipTrans, we can scale up dex manip data greatly with minimal effort. For more details, please check our -webpage: -code: -huggingface:

📢📢📢 Excited to release ManipTrans: Efficient Dexterous Bimanual Manipulation Transfer via Residual Learning (CVPR25). 🤏🤙✌️With ManipTrans, we can transfer dexterous manipulation skills into robotic hands in simulation and deploy them on a real robot, using a residual policy learned for dex manipulation. 🤖🤖🤖The video below illustrates how the MoCap data can be transferred to Inspire, Shadow, Xhand, Allegro, and Mano. With ManipTrans, we can scale up dex manip data greatly with minimal effort. For more details, please check our -webpage: -code: -huggingface:

Siyuan Huang

20,918 görüntüleme • 1 yıl önce

Check out our #ICRA2024 paper "Actor-Critic Model Predictive Control." Model-free #reinforcementlearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) benefits from robustness and online replanning capabilities. We combine both approaches by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an Actor-Critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in simulation and the real world with a quadcopter across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior. Paper: Full Video with more details: Kudos to Ángel Romero, Yunlong Song IEEE ICRA University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)

Check out our #ICRA2024 paper "Actor-Critic Model Predictive Control." Model-free #reinforcementlearning (RL) is known for its strong task performance and flexibility in optimizing general reward formulations. On the other hand, #ModelPredictiveControl (MPC) benefits from robustness and online replanning capabilities. We combine both approaches by introducing a new framework called Actor-Critic Model Predictive Control. The key idea is to embed a differentiable MPC within an Actor-Critic RL framework. The proposed approach leverages the short-term predictive optimization capabilities of MPC with the exploratory and end-to-end training properties of RL. The resulting policy effectively manages both short-term decisions through the MPC-based actor and long-term prediction via the critic network, unifying the benefits of both model-based control and end-to-end learning. We validate our method in simulation and the real world with a quadcopter across various high-level tasks. We show that the proposed architecture can achieve real-time control performance, learn complex behaviors via trial and error, and retain the predictive properties of the MPC to better handle out-of-distribution behavior. Paper: Full Video with more details: Kudos to Ángel Romero, Yunlong Song IEEE ICRA University of Zurich UZH Science UZH Space Hub Aerial Core AUTOASSESS European Research Council (ERC)

Davide Scaramuzza

34,889 görüntüleme • 2 yıl önce

Learning Physically Simulated Tennis Skills from Broadcast Videos demonstrate that system produces controllers for physically-simulated tennis players that can hit the incoming ball to target positions accurately using a diverse array of strokes (serves, forehands, and backhands), spins (topspins and slices), and playing styles (one/two-handed backhands, left/right-handed play). Overall, system can synthesize two physically simulated characters playing extended tennis rallies with simulated racket and ball dynamics paper: project page:

Learning Physically Simulated Tennis Skills from Broadcast Videos demonstrate that system produces controllers for physically-simulated tennis players that can hit the incoming ball to target positions accurately using a diverse array of strokes (serves, forehands, and backhands), spins (topspins and slices), and playing styles (one/two-handed backhands, left/right-handed play). Overall, system can synthesize two physically simulated characters playing extended tennis rallies with simulated racket and ball dynamics paper: project page:

AK

125,502 görüntüleme • 3 yıl önce

Can an inexpensive, off-the-shelf IMU be the only sensor to estimate the full state (position, velocity, orientation) of a quadrotor flying through a track at high speed and even be on-pair with vision-based localization? The answer is yes, within certain limitations! In this #RAL2023 paper, we propose a learning-based odometry algorithm that couples a model-based filter driven by the inertial measurements with a learning-based module with access to the control commands. Our system outperforms by a large margin the state-of-the-art visual-inertial odometry (#VIO) algorithms and the state-of-the-art learned-inertial odometry algorithm, #TLIO, for the task of drone racing. Additionally, we show that our system is as accurate as a VIO algorithm that uses a camera to localize to a known map of the racing track. The main limitation of our approach is that it cannot generalize to trajectories that have not been seen at training time. However, in drone racing competitions, the track is known beforehand. Human pilots spend hours or even days of practice on the race track before the competition. Similarly, our system can be trained with the data collected during practice time and deployed during the competition. Future work will investigate how to generalize to trajectories not seen at training time. The code is released! Paper: Video: Code: Kudos to Giovanni Cioffi Leonard Bauersfeld Elia Kaufmann European Research Council (ERC) University of Zurich UZH Science UZH Space Hub NCCR Robotics Aerial Core #RAL2023 #IROS2023 #SLAM

Can an inexpensive, off-the-shelf IMU be the only sensor to estimate the full state (position, velocity, orientation) of a quadrotor flying through a track at high speed and even be on-pair with vision-based localization? The answer is yes, within certain limitations! In this #RAL2023 paper, we propose a learning-based odometry algorithm that couples a model-based filter driven by the inertial measurements with a learning-based module with access to the control commands. Our system outperforms by a large margin the state-of-the-art visual-inertial odometry (#VIO) algorithms and the state-of-the-art learned-inertial odometry algorithm, #TLIO, for the task of drone racing. Additionally, we show that our system is as accurate as a VIO algorithm that uses a camera to localize to a known map of the racing track. The main limitation of our approach is that it cannot generalize to trajectories that have not been seen at training time. However, in drone racing competitions, the track is known beforehand. Human pilots spend hours or even days of practice on the race track before the competition. Similarly, our system can be trained with the data collected during practice time and deployed during the competition. Future work will investigate how to generalize to trajectories not seen at training time. The code is released! Paper: Video: Code: Kudos to Giovanni Cioffi Leonard Bauersfeld Elia Kaufmann European Research Council (ERC) University of Zurich UZH Science UZH Space Hub NCCR Robotics Aerial Core #RAL2023 #IROS2023 #SLAM

Davide Scaramuzza

37,061 görüntüleme • 2 yıl önce

This is the Gaza we want to return to — and rebuild with our own hands. We do not want a polished Gaza built by those who destroyed it. We do not want a Gaza treated as land without people, or as an “investment project” for those whose hands are stained with the blood of our families. We do not want deception, false promises, or charity wrapped in power. We want Gaza in its simplicity — the Gaza we come from, and the Gaza that comes from us.

This is the Gaza we want to return to — and rebuild with our own hands. We do not want a polished Gaza built by those who destroyed it. We do not want a Gaza treated as land without people, or as an “investment project” for those whose hands are stained with the blood of our families. We do not want deception, false promises, or charity wrapped in power. We want Gaza in its simplicity — the Gaza we come from, and the Gaza that comes from us.

Moayed Harazen 🇵🇸

306,541 görüntüleme • 6 ay önce

🔥 #ICRA2026 Best Paper Finalist The era of "robot VLA = single-arm gripper" is ending. Introducing Dexora — the first open-source Vision-Language-Action system for dual-arm, dual-hand, 36-DoF dexterous manipulation. 🦾 Dual Arms 🖐️ Dual Hands 🎯 36 DoF Control 🌍 Open Source Trained on: • 100K simulated trajectories • 10K real-world demonstrations Dexora achieves: ✓ 90%+ success on basic manipulation ✓ Strong dexterous manipulation performance ✓ Cross-embodiment generalization Our key hypothesis: Train on the hardest embodiment. Transfer to simpler robots later. Instead of scaling up gripper policies, we train directly in the most expressive action space and project downward to simpler embodiments. This may be a practical path toward universal robot controllers. 🎥 Demos: 📄 Paper:

🔥 #ICRA2026 Best Paper Finalist The era of "robot VLA = single-arm gripper" is ending. Introducing Dexora — the first open-source Vision-Language-Action system for dual-arm, dual-hand, 36-DoF dexterous manipulation. 🦾 Dual Arms 🖐️ Dual Hands 🎯 36 DoF Control 🌍 Open Source Trained on: • 100K simulated trajectories • 10K real-world demonstrations Dexora achieves: ✓ 90%+ success on basic manipulation ✓ Strong dexterous manipulation performance ✓ Cross-embodiment generalization Our key hypothesis: Train on the hardest embodiment. Transfer to simpler robots later. Instead of scaling up gripper policies, we train directly in the most expressive action space and project downward to simpler embodiments. This may be a practical path toward universal robot controllers. 🎥 Demos: 📄 Paper:

Hao Zhao

17,048 görüntüleme • 1 ay önce

Milestone! We (robotic arms for gadgets assembly) finished the first commercial order, which brought the first revenue. Here are some learnings from this: The customer was a smart toy manufacturer. The task was to add a heatsink to Raspberry Pi. We received parts from them and returned the assembled modules back. Currently, it's done by teleoperation. Later it will be done by a remote employee via the Internet. Then it will be automated action by action, reducing the operator's time on this and making the task profitable. ps. If you have an assembly task that we can do for you asynchronically - leave a comment below. Learning 1. It's possible! This task which is usually done by the human arm with 5 fingers can be done with a two-finger gripper with the addition of a couple of simple tooling. The task was not simplified. We peeled off thin films from stickers, unpacked paper boxes, moved PCB boards full of components, etc. And no unsolvable problems have been encountered yet. Challenges: 1) The paper box shifted during the opening Solved with the plastic walls that you can lean against 2) Heat pad, stuck to the gripper instead of heat sync. Can be solved by gripper with a pump, but this time solved with the patience of the operator 3) The film on the pad is very thin. Turned out that sub-millimeter arm precision is enough to peel it off with just a regular gripper. 4) The working area has not enough space. You'll only know this by doing real tasks in bulk. This could be solved by an extra pair of long arms, but in this case, solved with the patience of the operator. I think that in the end, we will have 5-10 types of universal tooling and 5-10 types of grippers to solve almost all the problems in such assembly tasks. Learning 2. It's slow. It took 5 times more time, than doing it with human hands. But the good news is there's a lot of room for improvement. We now have specific “time for task” metrics, which we will decrease with iterations. The main reasons for slowness: 1) To rotate the gripper to a steep angle you are forced to control one robot arm with two hands instead of using both arms. We can fix this by just making more room for rotations. 2) Grabbing PCB board with two arms is hard. A slight difference in rotation can break the board, and it's hard to control these angles visually. To solve this, the best way is to use force feedback so you can feel the pressure applied to the item. 3) Accuracy and steadiness is still can be improved We will try a metal version and double the motors to do this. 4) It is physically difficult for the human hands to move with such precision To solve this, we will add a pad for the hands like in surgical robots Learning 3. It's a good business model The "Factory in the cloud" is a good business model for this stage. You send us parts and we send back assembled modules. Currently, it's more convenient than sending a robot to your place, as we can iterate/fix the robot quickly and utilize it 100% of the time. When we polish the set-up over time - we can send robots to your place. So if we can assemble something for you in the USA with Chinese prices by using modern automation - leave a comment below.

Milestone! We (robotic arms for gadgets assembly) finished the first commercial order, which brought the first revenue. Here are some learnings from this: The customer was a smart toy manufacturer. The task was to add a heatsink to Raspberry Pi. We received parts from them and returned the assembled modules back. Currently, it's done by teleoperation. Later it will be done by a remote employee via the Internet. Then it will be automated action by action, reducing the operator's time on this and making the task profitable. ps. If you have an assembly task that we can do for you asynchronically - leave a comment below. Learning 1. It's possible! This task which is usually done by the human arm with 5 fingers can be done with a two-finger gripper with the addition of a couple of simple tooling. The task was not simplified. We peeled off thin films from stickers, unpacked paper boxes, moved PCB boards full of components, etc. And no unsolvable problems have been encountered yet. Challenges: 1) The paper box shifted during the opening Solved with the plastic walls that you can lean against 2) Heat pad, stuck to the gripper instead of heat sync. Can be solved by gripper with a pump, but this time solved with the patience of the operator 3) The film on the pad is very thin. Turned out that sub-millimeter arm precision is enough to peel it off with just a regular gripper. 4) The working area has not enough space. You'll only know this by doing real tasks in bulk. This could be solved by an extra pair of long arms, but in this case, solved with the patience of the operator. I think that in the end, we will have 5-10 types of universal tooling and 5-10 types of grippers to solve almost all the problems in such assembly tasks. Learning 2. It's slow. It took 5 times more time, than doing it with human hands. But the good news is there's a lot of room for improvement. We now have specific “time for task” metrics, which we will decrease with iterations. The main reasons for slowness: 1) To rotate the gripper to a steep angle you are forced to control one robot arm with two hands instead of using both arms. We can fix this by just making more room for rotations. 2) Grabbing PCB board with two arms is hard. A slight difference in rotation can break the board, and it's hard to control these angles visually. To solve this, the best way is to use force feedback so you can feel the pressure applied to the item. 3) Accuracy and steadiness is still can be improved We will try a metal version and double the motors to do this. 4) It is physically difficult for the human hands to move with such precision To solve this, we will add a pad for the hands like in surgical robots Learning 3. It's a good business model The "Factory in the cloud" is a good business model for this stage. You send us parts and we send back assembled modules. Currently, it's more convenient than sending a robot to your place, as we can iterate/fix the robot quickly and utilize it 100% of the time. When we polish the set-up over time - we can send robots to your place. So if we can assemble something for you in the USA with Chinese prices by using modern automation - leave a comment below.

Igor Kulakov

37,266 görüntüleme • 1 yıl önce

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

Make Pixels Dance: High-Dynamic Video Generation paper page: Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately, current state-of-the-art video generation methods, primarily focusing on text-to-video generation, tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper, we introduce PixelDance, a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions, setting a new standard for video generation.

AK

102,444 görüntüleme • 2 yıl önce

#WATCH | Chandigarh: On the India vs Pakistan match today in the Asia Cup 2025, Congress MP Manish Tewari says, "There has to be consistency of policy. If we are not engaging with Pakistan because it continues to be a State Sponsor of terror, then we should not play Cricket with them also. On one side, you are saying that credible information exists that money from the IMF is being diverted to ostensibly reconstruct HQ of the LeT. LeT is designated terrorist organisation, responsible for the attack on J&K Vidhan Sabha in 2000, attack on the Indian Parliament in 2001, attack in Mumbai in 2008, it is responsible for the Uri attack, it is responsible for the Pulwama attack, it is involved in Pahalgam attack. So, if the Lashkar's HQ are being reconstructed with developmental assistance provided by the IMF and on the other hand, we play Cricket with Pakistan, this means that we have no consistency of policy. So, there has to be one policy. If the policy is that talks and terror cannot go hand in hand, then the policy obviously should extend to that Cricket and terror cannot go hand in hand."

#WATCH | Chandigarh: On the India vs Pakistan match today in the Asia Cup 2025, Congress MP Manish Tewari says, "There has to be consistency of policy. If we are not engaging with Pakistan because it continues to be a State Sponsor of terror, then we should not play Cricket with them also. On one side, you are saying that credible information exists that money from the IMF is being diverted to ostensibly reconstruct HQ of the LeT. LeT is designated terrorist organisation, responsible for the attack on J&K Vidhan Sabha in 2000, attack on the Indian Parliament in 2001, attack in Mumbai in 2008, it is responsible for the Uri attack, it is responsible for the Pulwama attack, it is involved in Pahalgam attack. So, if the Lashkar's HQ are being reconstructed with developmental assistance provided by the IMF and on the other hand, we play Cricket with Pakistan, this means that we have no consistency of policy. So, there has to be one policy. If the policy is that talks and terror cannot go hand in hand, then the policy obviously should extend to that Cricket and terror cannot go hand in hand."

ANI

15,972 görüntüleme • 10 ay önce

.Stephen Miller: "It's certainly our view as an administration that it is untenable for individual district court judges to try to assert control over the functioning of the entire executive branch. This is not found anywhere in the Constitution. It's inconsistent with any notion of democracy, and it's inconsistent with the notion that we are a country that settles big policy disputes by holding national elections. We're going to vigorously protect the prerogative of the President to change federal policy and, in most of these cases, to assert control over the federal bureaucracy. The central issue here is whether policy decisions are made by the President or by unelected career bureaucrats."

.Stephen Miller: "It's certainly our view as an administration that it is untenable for individual district court judges to try to assert control over the functioning of the entire executive branch. This is not found anywhere in the Constitution. It's inconsistent with any notion of democracy, and it's inconsistent with the notion that we are a country that settles big policy disputes by holding national elections. We're going to vigorously protect the prerogative of the President to change federal policy and, in most of these cases, to assert control over the federal bureaucracy. The central issue here is whether policy decisions are made by the President or by unelected career bureaucrats."

KanekoaTheGreat

70,275 görüntüleme • 1 yıl önce

Experiments in progress. The one on the right has been learning for ~3 hours, the one in the middle for ~1 hour, and the one on the left just started a few minutes ago. The initial motivation for making the physical Atari was just to commit ourselves to a subset of algorithms that can make progress in this setup. This commitment rules out algorithms that require billions of samples to learn (or worse, require multiple environments running in parallel). Atari games are simple enough that we should be able to show learning on them in a short amount of time with no prior knowledge. Since then, I've realized that this setup is also a good way to compare different paradigms in robotics in a principled way. These paradigms are sim2real, learning from tele-operated data, and learning directly on the robots. So far, I have observed that getting sim2real to work reliably is hard. It requires tweaks that don't scale. Policies that can play perfectly in simulation fall apart because of latencies and the messiness of the real world. These aspects could be modeled to improve the simulation, but not without sinking significant human engineering hours. I have higher hopes for learning from tele-operated data, but that requires a human to learn the task first. These experiments are on my to-do list. I have to learn to play some of the games well through the robot. I’m half-decent at playing Pong and Ms Pacman now. Learning directly on robots is looking like the most promising approach. This approach takes away pesky distribution shifts and makes it possible to have algorithms that continually improve with more data and time without any human intervention. It feels great to let experiments run overnight and wake up to find improved policies. With learning on robots, I should, in principle, be able to go on a long vacation and come back to find better policies for complex tasks beyond Atari games. Whether that is possible with current learning algorithms is a different question.

Experiments in progress. The one on the right has been learning for ~3 hours, the one in the middle for ~1 hour, and the one on the left just started a few minutes ago. The initial motivation for making the physical Atari was just to commit ourselves to a subset of algorithms that can make progress in this setup. This commitment rules out algorithms that require billions of samples to learn (or worse, require multiple environments running in parallel). Atari games are simple enough that we should be able to show learning on them in a short amount of time with no prior knowledge. Since then, I've realized that this setup is also a good way to compare different paradigms in robotics in a principled way. These paradigms are sim2real, learning from tele-operated data, and learning directly on the robots. So far, I have observed that getting sim2real to work reliably is hard. It requires tweaks that don't scale. Policies that can play perfectly in simulation fall apart because of latencies and the messiness of the real world. These aspects could be modeled to improve the simulation, but not without sinking significant human engineering hours. I have higher hopes for learning from tele-operated data, but that requires a human to learn the task first. These experiments are on my to-do list. I have to learn to play some of the games well through the robot. I’m half-decent at playing Pong and Ms Pacman now. Learning directly on robots is looking like the most promising approach. This approach takes away pesky distribution shifts and makes it possible to have algorithms that continually improve with more data and time without any human intervention. It feels great to let experiments run overnight and wake up to find improved policies. With learning on robots, I should, in principle, be able to go on a long vacation and come back to find better policies for complex tasks beyond Atari games. Whether that is possible with current learning algorithms is a different question.

Khurram Javed

52,110 görüntüleme • 7 ay önce

Tencent announces AppAgent Multimodal Agents as Smartphone Users paper page: Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

Tencent announces AppAgent Multimodal Agents as Smartphone Users paper page: Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework enables the agent to operate smartphone applications through a simplified action space, mimicking human-like interactions such as tapping and swiping. This novel approach bypasses the need for system back-end access, thereby broadening its applicability across diverse apps. Central to our agent's functionality is its innovative learning method. The agent learns to navigate and use new apps either through autonomous exploration or by observing human demonstrations. This process generates a knowledge base that the agent refers to for executing complex tasks across different applications. To demonstrate the practicality of our agent, we conducted extensive testing over 50 tasks in 10 different applications, including social media, email, maps, shopping, and sophisticated image editing tools. The results affirm our agent's proficiency in handling a diverse array of high-level tasks.

AK

343,919 görüntüleme • 2 yıl önce