
Xiaolong Wang
@xiaolonw • 21,558 subscribers
Research Director, @Meta Superintelligence Labs Co-founder of ARI Associate Professor @UCSDJacobs Postdoc @berkeley_ai PhD @CMU_Robotics
Shorts
Videos

The code of GSPN #CVPR2025 is released! We proposed a new sqrt(N) complexity attention mechanism, which enables efficient high resolution image generation. We can generate 8k images with 42x speed up compared to self-attention in StableDiffusionXL! Code: Paper:
Xiaolong Wang354,791 views • 1 year ago

Test-Time Training (TTT) is now on Video! And not just a 5-second video. We can generate a full 1-min video! TTT module is an RNN module that provides an explicit and efficient memory mechanism. It models the hidden state of an RNN with a machine learning model, which is updated via gradient descent. Combined with a Diffusion Transformer, we are able to generate a 1-min Tom and Jerry cartoon. Enjoy our video with input script (not seen before): Jerry happily eats cheese in a tidy kitchen until Tom playfully takes it away, teasing him. Annoyed, Jerry packs his belongings and leaves home, dragging a small suitcase behind him. Later, Tom notices Jerry's absence, feels sad, and follows Jerry's tiny footprints all the way to San Francisco. Jerry sits disheartened in an alleyway, where Tom finds him, gently offering cheese as an apology. Jerry forgives Tom, accepts the cheese, and the two return home together, their friendship restored.
Xiaolong Wang186,332 views • 1 year ago

Let’s think about humanoid robots outside carrying the box. How about having the humanoid come out the door, interact with humans, and even dance? Introducing Expressive Whole-Body Control for Humanoid Robots: See how our robot performs rich, diverse, and expressive motions in the real world 👇🧵
Xiaolong Wang309,009 views • 2 years ago

This work is not about a new technique. GMT (General Motion Tracking) shows good engineering practices that you can actually train a single unified whole-body control policy for all agile motion, and it works in the real world, directly with sim2real without adaptation. This is different from many existing works that trains/adapts one individual policy for each specific motion. Instead of just showing the beautiful motions/dances, this is actually practical. Because now you can convert all types of commands into real actions in real time without tuning anything. Visit to check out how we handle the following challenges: ✅ Partial Observability. ✅ Hardware Limitations. ✅ Unbalanced Data Distribution. ✅ Model Expressiveness. Work led by Zixuan Chen Mazeyu Ji
Xiaolong Wang83,261 views • 11 months ago

I have been cleaning my daughter's mess for more than two years now. Last weekend our robot came to home to do the job for me. 🤖 Our new work on visual whole-body control learns a policy to coordinate the robot legs and arms for mobile manipulation. See how the legs bend on grasping objects on the ground based on visual inputs autonomously. (It is NOT teleoperation!) We again adopt a Sim2Real approach and our policy generalizes to beaches, forests, and streets in San Diego. 👇🧵
Xiaolong Wang159,145 views • 2 years ago

Is 3D scene generation much closer to being solved all of a sudden? It has been a few days since the release of OpenAI Sora. We run our COLMAP-Free 3D Gaussian Splatting on the released videos. Our method does not need to pre-process cameras and it seems we can directly just get 3D from the videos. Check out our results here. 🧵👇 (1/n
Xiaolong Wang157,348 views • 2 years ago

Tesla Optimus can arrange batteries in their factories, ours can do skincare (on Yuzhe Qin)! We opensource Bunny-VisionPro, a teleoperation system for bimanual hand manipulation. The users can control the robot hands in real time using VisionPro, flexible like a bunny. 🐇 We also have kitchen tasks, playing Rubik's Cube, and dynamic motion tasks. Imitation learning policies are trained on sweeping with a broom, serving a drink, and wiping glasses. Check our website for more details: The project is led by Runyu Ding Runyu Ding, Yuzhe Qin Yuzhe Qin , and Jiyue Zhu
Xiaolong Wang90,735 views • 2 years ago

3D Gaussian Splatting is great, but can it work without the pre-computed camera poses? Introducing: COLMAP-Free 3D Gaussian Splatting Our recent work shows not only it can, but 3D Gaussians make camera pose estimation easy (compared to NeRF) along with reconstruction. 👇🧵
Xiaolong Wang76,747 views • 2 years ago

Mobile-TeleVision follows our previous idea on upper-body and lower-body separation for control: 1⃣We train RL for lower-body locomotion and standing; 2⃣We use IK and retargeting directly for upper-body for high-precision and smooth manipulation. We avoid using RL for upper-body motion tracking. This is because fundamentally it is hard to avoid errors with RL tracking, and sim2real gap brings another gap. At this point, we still find using IK is more practical and precise.
Xiaolong Wang32,662 views • 1 year ago

We have seen a lot of legged robots doing navigation in the wild. But how about mobile manipulation in the wild? I have been pushing the direction of learning a unified, efficient, and dynamic 3D representation of scenes (for navigation) and objects (for manipulation) for the past two years. And now we have GeFF --- our large-scale, generalizable feature field, that combines the speed of a feed-forward neural network with the rich semantics from Foundation Models, to handle dynamically changing scenes, and enable open-ended, language-grounded scene and object understanding.
Xiaolong Wang42,767 views • 2 years ago

Introducing Open-TeleVision: with Fully Autonomous policy video👇. We can conduct a long-horizon task with inserting 12 cans nonstop without any interruptions. We offer: 🤖 Highly precise and smooth bimanual manipulation. 📺 Active egocentric vision (with a moving neck) feedback. It is achieved by imitation learning from teleoperation: We propose a VR-based REAL-TIME teleoperation that streams the stereo video observation from the robot camera to the VR device. The robot neck moves as the human head moves, the robot hands move as the human hands move, offering the operator an intuitive experience as the human herself becomes the robot. The devils are all in the details, and how to implement things right: ✅How to perform IK/retargeting for smooth and precise control. ✅How to do all these and also stream stereo video without no latency, all in real time. We released our code here: Active head hardware design: 1/n
Xiaolong Wang25,572 views • 1 year ago

One more example of NaVILA for the G1 humanoid robot. The instruction is: "Walk forward. Step on the grass and continue going forward. Stop when you are close to the big bear statue." We only got the robot very recently and it works right away when plugging in NaVILA. Kudos to Xuxin Cheng and Jialong Li 's help on setting up the robot.
Xiaolong Wang13,823 views • 1 year ago
No more content to load