Xiaolong Wang's banner

Xiaolong Wang

@xiaolonw • 21,955 subscribers

Research Director, @Meta Superintelligence Labs Co-founder of ARI Associate Professor @UCSDJacobs Postdoc @berkeley_ai PhD @CMU_Robotics

Shorts

A behind-the-scenes video on how teleoperation is done. Whole-body manipulation with only a VisionPro.

A behind-the-scenes video on how teleoperation is done. Whole-body manipulation with only a VisionPro.

40,309 views

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

The code of GSPN #CVPR2025 is released! We proposed a new sqrt(N) complexity attention mechanism, which enables efficient high resolution image generation. We can generate 8k images with 42x speed up compared to self-attention in StableDiffusionXL! Code: Paper:

The code of GSPN #CVPR2025 is released! We proposed a new sqrt(N) complexity attention mechanism, which enables efficient high resolution image generation. We can generate 8k images with 42x speed up compared to self-attention in StableDiffusionXL! Code: Paper:

354,887 views • 1 year ago

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated via gradient descent. Combined with a Diffusion Transformer, we are able to generate a 1-min Tom and Jerry cartoon. Enjoy our video with input script (not seen before): Jerry happily eats cheese in a tidy kitchen until Tom playfully takes it away, teasing him. Annoyed, Jerry packs his belongings and leaves home, dragging a small suitcase behind him. Later, Tom notices Jerry's absence, feels sad, and follows Jerry's tiny footprints all the way to San Francisco. Jerry sits disheartened in an alleyway, where Tom finds him, gently offering cheese as an apology. Jerry forgives Tom, accepts the cheese, and the two return home together, their friendship restored.

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated via gradient descent. Combined with a Diffusion Transformer, we are able to generate a 1-min Tom and Jerry cartoon. Enjoy our video with input script (not seen before): Jerry happily eats cheese in a tidy kitchen until Tom playfully takes it away, teasing him. Annoyed, Jerry packs his belongings and leaves home, dragging a small suitcase behind him. Later, Tom notices Jerry's absence, feels sad, and follows Jerry's tiny footprints all the way to San Francisco. Jerry sits disheartened in an alleyway, where Tom finds him, gently offering cheese as an apology. Jerry forgives Tom, accepts the cheese, and the two return home together, their friendship restored.

186,487 views • 1 year ago

Let’s think about humanoid robots outside carrying the box. How about having the humanoid come out the door, interact with humans, and even dance? Introducing Expressive Whole-Body Control for Humanoid Robots: See how our robot performs rich, diverse, and expressive motions in the real world 👇🧵

Let’s think about humanoid robots outside carrying the box. How about having the humanoid come out the door, interact with humans, and even dance? Introducing Expressive Whole-Body Control for Humanoid Robots: See how our robot performs rich, diverse, and expressive motions in the real world 👇🧵

309,118 views • 2 years ago

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

Stable Diffusion generates beautiful images, but can it be used for open-world recognition? Try Demo! Our #CVPR2023 paper shows that the pre-trained diffusion model indeed is a good image parser, allows for open-vocabulary segmentation and detection.

241,243 views • 3 years ago

I have been cleaning my daughter's mess for more than two years now. Last weekend our robot came to home to do the job for me. 🤖 Our new work on visual whole-body control learns a policy to coordinate the robot legs and arms for mobile manipulation. See how the legs bend on grasping objects on the ground based on visual inputs autonomously. (It is NOT teleoperation!) We again adopt a Sim2Real approach and our policy generalizes to beaches, forests, and streets in San Diego. 👇🧵

I have been cleaning my daughter's mess for more than two years now. Last weekend our robot came to home to do the job for me. 🤖 Our new work on visual whole-body control learns a policy to coordinate the robot legs and arms for mobile manipulation. See how the legs bend on grasping objects on the ground based on visual inputs autonomously. (It is NOT teleoperation!) We again adopt a Sim2Real approach and our policy generalizes to beaches, forests, and streets in San Diego. 👇🧵

159,222 views • 2 years ago

Is 3D scene generation much closer to being solved all of a sudden? It has been a few days since the release of OpenAI Sora. We run our COLMAP-Free 3D Gaussian Splatting on the released videos. Our method does not need to pre-process cameras and it seems we can directly just get 3D from the videos. Check out our results here. 🧵👇 (1/n

Is 3D scene generation much closer to being solved all of a sudden? It has been a few days since the release of OpenAI Sora. We run our COLMAP-Free 3D Gaussian Splatting on the released videos. Our method does not need to pre-process cameras and it seems we can directly just get 3D from the videos. Check out our results here. 🧵👇 (1/n

157,383 views • 2 years ago

This work is not about a new technique. GMT (General Motion Tracking) shows good engineering practices that you can actually train a single unified whole-body control policy for all agile motion, and it works in the real world, directly with sim2real without adaptation. This is different from many existing works that trains/adapts one individual policy for each specific motion. Instead of just showing the beautiful motions/dances, this is actually practical. Because now you can convert all types of commands into real actions in real time without tuning anything. Visit to check out how we handle the following challenges: ✅ Partial Observability. ✅ Hardware Limitations. ✅ Unbalanced Data Distribution. ✅ Model Expressiveness. Work led by Zixuan Chen Mazeyu Ji

This work is not about a new technique. GMT (General Motion Tracking) shows good engineering practices that you can actually train a single unified whole-body control policy for all agile motion, and it works in the real world, directly with sim2real without adaptation. This is different from many existing works that trains/adapts one individual policy for each specific motion. Instead of just showing the beautiful motions/dances, this is actually practical. Because now you can convert all types of commands into real actions in real time without tuning anything. Visit to check out how we handle the following challenges: ✅ Partial Observability. ✅ Hardware Limitations. ✅ Unbalanced Data Distribution. ✅ Model Expressiveness. Work led by Zixuan Chen Mazeyu Ji

83,261 views • 1 year ago

Tesla Optimus can arrange batteries in their factories, ours can do skincare (on Yuzhe Qin)! We opensource Bunny-VisionPro, a teleoperation system for bimanual hand manipulation. The users can control the robot hands in real time using VisionPro, flexible like a bunny. 🐇 We also have kitchen tasks, playing Rubik's Cube, and dynamic motion tasks. Imitation learning policies are trained on sweeping with a broom, serving a drink, and wiping glasses. Check our website for more details: The project is led by Runyu Ding Runyu Ding, Yuzhe Qin Yuzhe Qin , and Jiyue Zhu

Tesla Optimus can arrange batteries in their factories, ours can do skincare (on Yuzhe Qin)! We opensource Bunny-VisionPro, a teleoperation system for bimanual hand manipulation. The users can control the robot hands in real time using VisionPro, flexible like a bunny. 🐇 We also have kitchen tasks, playing Rubik's Cube, and dynamic motion tasks. Imitation learning policies are trained on sweeping with a broom, serving a drink, and wiping glasses. Check our website for more details: The project is led by Runyu Ding Runyu Ding, Yuzhe Qin Yuzhe Qin , and Jiyue Zhu

90,902 views • 2 years ago

The robot climbs stairs🏯, steps over stones 🪨, and runs in the wild🏞️, all in one policy, without any remote control! Our #CVPR2023 Highlight paper achieves this by using RL + a 3D Neural Volumetric Memory (NVM) trained with view synthesis!

The robot climbs stairs🏯, steps over stones 🪨, and runs in the wild🏞️, all in one policy, without any remote control! Our #CVPR2023 Highlight paper achieves this by using RL + a 3D Neural Volumetric Memory (NVM) trained with view synthesis!

113,006 views • 3 years ago

3D Gaussian Splatting is great, but can it work without the pre-computed camera poses? Introducing: COLMAP-Free 3D Gaussian Splatting Our recent work shows not only it can, but 3D Gaussians make camera pose estimation easy (compared to NeRF) along with reconstruction. 👇🧵

3D Gaussian Splatting is great, but can it work without the pre-computed camera poses? Introducing: COLMAP-Free 3D Gaussian Splatting Our recent work shows not only it can, but 3D Gaussians make camera pose estimation easy (compared to NeRF) along with reconstruction. 👇🧵

76,747 views • 2 years ago

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

Vision-Language Foundation model should go to 3D for robotics!🤖 CoRL23 Oral: GNFactor learns Generalizable Neural Feature Fields for language conditioned manipulation on diverse scenes. It unifies 3D➕Stable Diffusion features using generalizable NeRFs.

56,268 views • 2 years ago

Mobile-TeleVision follows our previous idea on upper-body and lower-body separation for control: 1⃣We train RL for lower-body locomotion and standing; 2⃣We use IK and retargeting directly for upper-body for high-precision and smooth manipulation. We avoid using RL for upper-body motion tracking. This is because fundamentally it is hard to avoid errors with RL tracking, and sim2real gap brings another gap. At this point, we still find using IK is more practical and precise.

Mobile-TeleVision follows our previous idea on upper-body and lower-body separation for control: 1⃣We train RL for lower-body locomotion and standing; 2⃣We use IK and retargeting directly for upper-body for high-precision and smooth manipulation. We avoid using RL for upper-body motion tracking. This is because fundamentally it is hard to avoid errors with RL tracking, and sim2real gap brings another gap. At this point, we still find using IK is more practical and precise.

32,662 views • 1 year ago

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.

42,767 views • 2 years ago

Besides reading cool papers, my Twitter account is mostly used for catching up Formula 1 news. Very excited about Lando Norris 's great performance recently. Now we combine Formula Racing to AI research: The following video shows we train a Reinforcement Learning policy to drive a Dallara F317 in Monza and Barcelona Circuits autonomously in simulation. You can think of it as optimizing a Qualifying lap: You want the car to run a single lap, no opponents, with the fastest speed, and shortest lap time. We open-source the code to train RL in a racing simulator with Assetto Corsa, a widely deployed platform for esports. We also train with human demonstrations with multiple expert drivers including a professional e-sports driver. Interestingly AI is not beating the professional yet, but close. See our open-sourced code, simulator, and data here:

Besides reading cool papers, my Twitter account is mostly used for catching up Formula 1 news. Very excited about Lando Norris 's great performance recently. Now we combine Formula Racing to AI research: The following video shows we train a Reinforcement Learning policy to drive a Dallara F317 in Monza and Barcelona Circuits autonomously in simulation. You can think of it as optimizing a Qualifying lap: You want the car to run a single lap, no opponents, with the fastest speed, and shortest lap time. We open-source the code to train RL in a racing simulator with Assetto Corsa, a widely deployed platform for esports. We also train with human demonstrations with multiple expert drivers including a professional e-sports driver. Interestingly AI is not beating the professional yet, but close. See our open-sourced code, simulator, and data here:

32,361 views • 2 years ago

Presenting MonoNeRF at #ICML2023 We train a generalizable NeRF from: ✅Large-scale monocular videos instead of one scene ✅No GT camera poses.📷🚫 Without per-scene optimization, the model can do view synthesis, depth estimation, camera pose estimation.

Presenting MonoNeRF at #ICML2023 We train a generalizable NeRF from: ✅Large-scale monocular videos instead of one scene ✅No GT camera poses.📷🚫 Without per-scene optimization, the model can do view synthesis, depth estimation, camera pose estimation.

36,578 views • 3 years ago

Introducing Open-TeleVision: with Fully Autonomous policy video👇. We can conduct a long-horizon task with inserting 12 cans nonstop without any interruptions. We offer: 🤖 Highly precise and smooth bimanual manipulation. 📺 Active egocentric vision (with a moving neck) feedback. It is achieved by imitation learning from teleoperation: We propose a VR-based REAL-TIME teleoperation that streams the stereo video observation from the robot camera to the VR device. The robot neck moves as the human head moves, the robot hands move as the human hands move, offering the operator an intuitive experience as the human herself becomes the robot. The devils are all in the details, and how to implement things right: ✅How to perform IK/retargeting for smooth and precise control. ✅How to do all these and also stream stereo video without no latency, all in real time. We released our code here: Active head hardware design: 1/n

Introducing Open-TeleVision: with Fully Autonomous policy video👇. We can conduct a long-horizon task with inserting 12 cans nonstop without any interruptions. We offer: 🤖 Highly precise and smooth bimanual manipulation. 📺 Active egocentric vision (with a moving neck) feedback. It is achieved by imitation learning from teleoperation: We propose a VR-based REAL-TIME teleoperation that streams the stereo video observation from the robot camera to the VR device. The robot neck moves as the human head moves, the robot hands move as the human hands move, offering the operator an intuitive experience as the human herself becomes the robot. The devils are all in the details, and how to implement things right: ✅How to perform IK/retargeting for smooth and precise control. ✅How to do all these and also stream stereo video without no latency, all in real time. We released our code here: Active head hardware design: 1/n

25,572 views • 2 years ago

🏗️ Policy Adaptation from Foundation Model Feedback #CVPR2023 Instead of using foundation model as a pre-trained encoder (generator), we use it as a Teacher (discriminator) to tell where our policy did wrong and helps it adapts to new envs and tasks.

🏗️ Policy Adaptation from Foundation Model Feedback #CVPR2023 Instead of using foundation model as a pre-trained encoder (generator), we use it as a Teacher (discriminator) to tell where our policy did wrong and helps it adapts to new envs and tasks.

24,412 views • 3 years ago

One more example of NaVILA for the G1 humanoid robot. The instruction is: "Walk forward. Step on the grass and continue going forward. Stop when you are close to the big bear statue." We only got the robot very recently and it works right away when plugging in NaVILA. Kudos to Xuxin Cheng and Jialong Li 's help on setting up the robot.

One more example of NaVILA for the G1 humanoid robot. The instruction is: "Walk forward. Step on the grass and continue going forward. Stop when you are close to the big bear statue." We only got the robot very recently and it works right away when plugging in NaVILA. Kudos to Xuxin Cheng and Jialong Li 's help on setting up the robot.

13,823 views • 1 year ago

No more content to load