Loading video...
Video Failed to Load
Next step in dynamic dexterous grasping from NVIDIA: DextrAH-RGB! No more depth. We’re now consuming RGB stereo pairs, and the resulting perceptual system is much more robust. Trained entirely in sim (IsaacLab), leveraging fast tiled rendering, and deployed zero-shot to real.
47,247 views • 1 year ago •via X (Twitter)
13 Comments

Depth causes problems. In our earlier work (DextrAH-G, which consumed only depth w/o RGB), we had to cut out the background, block windows, and deal with eroded depth readings from object surface properties. This time around, none of that was a problem. Sunshine? No problem. Weird background? No problem. It all just worked.

We use a SOTA transformer architecture with resnet-18 encoders (pretrained then finetuned) for multicamera image processing. This network is substantially larger than the simple convnet architecture we used earlier in the depth only processing (DextrAH-G).

Here’s what the simulated camera feeds look like during distillation (decked out with all our domain randomizations). The second video shows the stereo RGB feed from the real-world deployment (plus some cool reactivity!).

But the overall training, distillation, and deployment pipeline is essentially the same, with some minor tweaks that end up speeding cycle time by 2x and improving robustness across object scales. Distillation now takes a little over 2 days on 4xL40S’s because the perceptual network is massive (relatively) and doing some significant work. We also added automatic domain randomization (ADR) to the teacher training, which starts domain randomization ranges small and incrementally grows them over time as performance metrics improve. The teacher training takes just over 2.5 days on 8xH100s.

All of the reactive regrasping you see is baked into the trained policy. It knows when it has a good grasp based on “feel” (proprioceptive difference between desired and measured joint angles).

We ran a series of ablations on the perceptual architecture looking at variations of the image encoder and stereo vs monocular inputs. The main takeaway is, it’s important to start the distillation process with a pretrained resnet-18 encoder and finetune it. We tried all combinations of starting from pretrained weights vs random weights and fine tuning it vs just freezing, and it’s important to do both. Surprisingly, monocular inputs worked much better than we expected. But it makes sense in retrospect. Close one eye and try picking things up from the table. It’s actually easier than it seems.

Links: Project website: DextrAH-RGB: Visuomotor Policies to Grasp Anything with Dexterous Hands

Progression of prior work building toward this system: DeXtreme: Early sim2real work. Trained entirely in simulation and zero-shot deployed into the real world. DeXtreme FGP: Adds geometric fabric controllers for safer deployment. Don’t damage the robot! Background on geometric fabrics (RAL best paper 2022): DextrAH-G: Unlock the arm, and do full on grasping. Grasps anything placed in front of it. Geometric fabrics again allow us to be brazen with deployment on such a fragile physical system. This version operates only on depth.

And finally DextrAH-RGB (this work): No more depth. Just direct stereo RGB processing with a scaled SOTA perception architecture.

Many thanks to all our coauthors! @ritvik_singh9 @arthurallshire @ankurhandos and Karl Van Wyk. Amazing work! Ritvik and Karl, especially, drove the work and everyone put their heads together to figure out the ideal perceptual processing architecture.

More details in @ritvik_singh9's thread (first author)!

And @arthurallshire's!

Customise the colours yourself or simply connect the Joy-Cons ! Head to ➡️ for more details about #AAAClock ! #indiedev #indiegame #IndieGameDev #nintendo #OLED #switchOLED
