Shreyas Gite's banner
Shreyas Gite's profile picture

Shreyas Gite

@shreyasgite1,454 subscribers

AI products at https://t.co/8SkahBzqk0. Prev: founded self-driving at Kopernikus (acq. @Ford), jet engines @RollsRoyce.

Shorts

Gemini-powered robot can now effectively debug itself! I've been obsessed with two main questions in robotics: can robots learn from their own mistakes without humans in the loop, and how much can we leverage synthetic data? Spoiler: yes, and it's surprisingly elegant once you have the right primitives in place. The architecture is fairly simple (and optimized for GPU_Poor users): Component I: Gemini Brain ♊️ - Gemini 2.0 Flash analyzes all training episodes through both camera perspectives - Gemini 2.0 Pro creates a summary of training data, highlighting biases, limitations, etc. - Train policy p0 on this initial data, run evaluation episodes - Ask Gemini to categorize successes vs. failures (more insightful than you'd expect) - Based on both analyses, Gemini generates specific augmentation recommendations What's interesting here isn't that we're using LLMs for robotics - it's that we're closing the loop between perception, failure analysis, and targeted data generation. Component II: Data Generation with Scene Consistency The tricky part was maintaining consistency across both camera perspectives while generating new data. Three current augmentations: - Frame flipping and polarity reversals - Grounded-SAM + OpenCV for object color manipulation - Gemini to identify empty space and generate distractions in the scene …and repeat, ha! I'm using the so100 robot arm and Sarah’s Vintage from Hugging Face. And the APIs and models in Gemini family are Ace! Thank you Logan Kilpatrick Patrick Loeber and team for this. In thread The Circus of Making It Actually Work🧵:

Gemini-powered robot can now effectively debug itself! I've been obsessed with two main questions in robotics: can robots learn from their own mistakes without humans in the loop, and how much can we leverage synthetic data? Spoiler: yes, and it's surprisingly elegant once you have the right primitives in place. The architecture is fairly simple (and optimized for GPU_Poor users): Component I: Gemini Brain ♊️ - Gemini 2.0 Flash analyzes all training episodes through both camera perspectives - Gemini 2.0 Pro creates a summary of training data, highlighting biases, limitations, etc. - Train policy p0 on this initial data, run evaluation episodes - Ask Gemini to categorize successes vs. failures (more insightful than you'd expect) - Based on both analyses, Gemini generates specific augmentation recommendations What's interesting here isn't that we're using LLMs for robotics - it's that we're closing the loop between perception, failure analysis, and targeted data generation. Component II: Data Generation with Scene Consistency The tricky part was maintaining consistency across both camera perspectives while generating new data. Three current augmentations: - Frame flipping and polarity reversals - Grounded-SAM + OpenCV for object color manipulation - Gemini to identify empty space and generate distractions in the scene …and repeat, ha! I'm using the so100 robot arm and Sarah’s Vintage from Hugging Face. And the APIs and models in Gemini family are Ace! Thank you Logan Kilpatrick Patrick Loeber and team for this. In thread The Circus of Making It Actually Work🧵:

47,245 Aufrufe

Pi0 vs. ACT with BBox conditioning 🟦 Not many know you can push ACT to *almost-Pi0* generalisation by conditioning on bounding boxes (BBoxes). How is the training data collected? • Generate BBoxes for all pick-and-place objects in the scene.(I used Gemini) • Pick-and-place targets are selected randomly. • Add the BBox coordinates to the robot’s state. • Overlay the BBoxes in the visualisation so you know what to grab and where to drop. During inference: • Generate BBoxes for every object again. • Click the object you want to pick and its target spot; those BBoxes get added to the robot state. • Let the robot do the work for you 😃 Setup: - Trained ACT for 100k steps and fine-tuned Pi0 for only 20k. - Training data is 60 episodes and had *only* LEGO bricks. - Using single front camera (Laptop in this case) Got the idea from xun in LeRobot discord. Here’s ACT vs Pi0 on a toy car that isn’t in the dataset. 1/3

Pi0 vs. ACT with BBox conditioning 🟦 Not many know you can push ACT to *almost-Pi0* generalisation by conditioning on bounding boxes (BBoxes). How is the training data collected? • Generate BBoxes for all pick-and-place objects in the scene.(I used Gemini) • Pick-and-place targets are selected randomly. • Add the BBox coordinates to the robot’s state. • Overlay the BBoxes in the visualisation so you know what to grab and where to drop. During inference: • Generate BBoxes for every object again. • Click the object you want to pick and its target spot; those BBoxes get added to the robot state. • Let the robot do the work for you 😃 Setup: - Trained ACT for 100k steps and fine-tuned Pi0 for only 20k. - Training data is 60 episodes and had *only* LEGO bricks. - Using single front camera (Laptop in this case) Got the idea from xun in LeRobot discord. Here’s ACT vs Pi0 on a toy car that isn’t in the dataset. 1/3

34,751 Aufrufe

Videos

shreyasgite's profile picture

Learning from Human Demonstrations: Show the Robot How to Act! The pipeline is very similar to older experiments using Gemini & pi0 with LeRobot. Pi-zero runs locally, while Gemini Flash generates the affordances and the high-level task. (More details are in the thread.) The new component is learning from demonstrations via Gemini 2.5 Pro. I capture a video while demoing & take one of the last frames. Gemini 2.5 Pro then extracts the instructions & passes them to Gemini Flash to process the scene. The fun part is that there's no fancy insight that came from me; other than the days spent figuring out the right prompts. It's the bitter lesson hitting you in the face -> Enhanced Gemini capabilities make this possible. For example, Gemini Flash cannot do Russian doll stacking, but Gemini 2.5 Pro can do it consistently. The current limitation is low-level manipulation: - As you can see, I'm aligning the objects so they are easy to grasp using the same technique from the training data. I couldn't get Gemini Flash to consistently output an accurate grasping angle, and Gemini 1.5 Pro was too expensive and slow for real-time deployment. - Getting a symmetrical gripper should also help a lot. Adding rubber to the tips would probably also help prevent objects from slipping. Collecting & curating the data was the most time consuming & labor intensive part. Next, to improve low-level manipulation and make the system more real-time, I'm shifting to focus more on sims & synthetic data. This aligns better with my core competence. I'm open to tips and suggestions.

Shreyas Gite

22,480 Aufrufe • vor 1 Jahr

Keine weiteren Inhalte verfügbar