Shreyas Gite's banner

Shreyas Gite

@shreyasgite • 1,454 subscribers

AI products at https://t.co/8SkahBzqk0. Prev: founded self-driving at Kopernikus (acq. @Ford), jet engines @RollsRoyce.

Shorts

Gemini-powered robot can now effectively debug itself! I've been obsessed with two main questions in robotics: can robots learn from their own mistakes without humans in the loop, and how much can we leverage synthetic data? Spoiler: yes, and it's surprisingly elegant once you have the right primitives in place. The architecture is fairly simple (and optimized for GPU_Poor users): Component I: Gemini Brain ♊️ - Gemini 2.0 Flash analyzes all training episodes through both camera perspectives - Gemini 2.0 Pro creates a summary of training data, highlighting biases, limitations, etc. - Train policy p0 on this initial data, run evaluation episodes - Ask Gemini to categorize successes vs. failures (more insightful than you'd expect) - Based on both analyses, Gemini generates specific augmentation recommendations What's interesting here isn't that we're using LLMs for robotics - it's that we're closing the loop between perception, failure analysis, and targeted data generation. Component II: Data Generation with Scene Consistency The tricky part was maintaining consistency across both camera perspectives while generating new data. Three current augmentations: - Frame flipping and polarity reversals - Grounded-SAM + OpenCV for object color manipulation - Gemini to identify empty space and generate distractions in the scene …and repeat, ha! I'm using the so100 robot arm and Sarah’s Vintage from Hugging Face. And the APIs and models in Gemini family are Ace! Thank you Logan Kilpatrick Patrick Loeber and team for this. In thread The Circus of Making It Actually Work🧵:

Gemini-powered robot can now effectively debug itself! I've been obsessed with two main questions in robotics: can robots learn from their own mistakes without humans in the loop, and how much can we leverage synthetic data? Spoiler: yes, and it's surprisingly elegant once you have the right primitives in place. The architecture is fairly simple (and optimized for GPU_Poor users): Component I: Gemini Brain ♊️ - Gemini 2.0 Flash analyzes all training episodes through both camera perspectives - Gemini 2.0 Pro creates a summary of training data, highlighting biases, limitations, etc. - Train policy p0 on this initial data, run evaluation episodes - Ask Gemini to categorize successes vs. failures (more insightful than you'd expect) - Based on both analyses, Gemini generates specific augmentation recommendations What's interesting here isn't that we're using LLMs for robotics - it's that we're closing the loop between perception, failure analysis, and targeted data generation. Component II: Data Generation with Scene Consistency The tricky part was maintaining consistency across both camera perspectives while generating new data. Three current augmentations: - Frame flipping and polarity reversals - Grounded-SAM + OpenCV for object color manipulation - Gemini to identify empty space and generate distractions in the scene …and repeat, ha! I'm using the so100 robot arm and Sarah’s Vintage from Hugging Face. And the APIs and models in Gemini family are Ace! Thank you Logan Kilpatrick Patrick Loeber and team for this. In thread The Circus of Making It Actually Work🧵:

47,245 Aufrufe

Pi0 vs. ACT with BBox conditioning 🟦 Not many know you can push ACT to *almost-Pi0* generalisation by conditioning on bounding boxes (BBoxes). How is the training data collected? • Generate BBoxes for all pick-and-place objects in the scene.(I used Gemini) • Pick-and-place targets are selected randomly. • Add the BBox coordinates to the robot’s state. • Overlay the BBoxes in the visualisation so you know what to grab and where to drop. During inference: • Generate BBoxes for every object again. • Click the object you want to pick and its target spot; those BBoxes get added to the robot state. • Let the robot do the work for you 😃 Setup: - Trained ACT for 100k steps and fine-tuned Pi0 for only 20k. - Training data is 60 episodes and had *only* LEGO bricks. - Using single front camera (Laptop in this case) Got the idea from xun in LeRobot discord. Here’s ACT vs Pi0 on a toy car that isn’t in the dataset. 1/3

Pi0 vs. ACT with BBox conditioning 🟦 Not many know you can push ACT to almost-Pi0 generalisation by conditioning on bounding boxes (BBoxes). How is the training data collected? • Generate BBoxes for all pick-and-place objects in the scene.(I used Gemini) • Pick-and-place targets are selected randomly. • Add the BBox coordinates to the robot’s state. • Overlay the BBoxes in the visualisation so you know what to grab and where to drop. During inference: • Generate BBoxes for every object again. • Click the object you want to pick and its target spot; those BBoxes get added to the robot state. • Let the robot do the work for you 😃 Setup: - Trained ACT for 100k steps and fine-tuned Pi0 for only 20k. - Training data is 60 episodes and had only LEGO bricks. - Using single front camera (Laptop in this case) Got the idea from xun in LeRobot discord. Here’s ACT vs Pi0 on a toy car that isn’t in the dataset. 1/3

34,863 Aufrufe

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

Having seen so many laundry folding robot demos, I started training with my bimanual setup, and gotta say I have new respect for those companies and individuals. Even collecting 200 episodes is tricky, and I cannot easily (yet!) bootstrap learning with affordances like I was able to do for lego sorting. Also realized that for bimanual; data collected by left-handed and right-handed people would look different. Super painful dealing with deformable fabric compared to solid objects.

Having seen so many laundry folding robot demos, I started training with my bimanual setup, and gotta say I have new respect for those companies and individuals. Even collecting 200 episodes is tricky, and I cannot easily (yet!) bootstrap learning with affordances like I was able to do for lego sorting. Also realized that for bimanual; data collected by left-handed and right-handed people would look different. Super painful dealing with deformable fabric compared to solid objects.

35,511 Aufrufe • vor 10 Monaten

Gemini + π0 = actually useful robots! (Similar to what Physical Intelligence did with "Hi Robot") I can now verbally tell the robot that I'm building a red Lego wall or wooden tower, and it will infer the next steps by itself and pass me the necessary pieces, tools, or materials, ha! You can also just ask it to bring you things! The pipeline works as follows: - OpenAI Whisper (local) → speech to text - Gemini → makes sense of user requests, converts to robot tasks, bounding boxes, grasping points, etc. (System 2 thinking FTW!) - π0 → robotic actions The π0 was finetuned just for pick-and-place Lego bricks only, and it generalizes beautifully to all kinds of tasks. However, there's lots of room for improvement when it comes to grasping & accuracy. Things that could help: - Conditioning on grasping points - Better data collection (I'm not that great at teleop) - Lots more synthetic data and simulations

Gemini + π0 = actually useful robots! (Similar to what Physical Intelligence did with "Hi Robot") I can now verbally tell the robot that I'm building a red Lego wall or wooden tower, and it will infer the next steps by itself and pass me the necessary pieces, tools, or materials, ha! You can also just ask it to bring you things! The pipeline works as follows: - OpenAI Whisper (local) → speech to text - Gemini → makes sense of user requests, converts to robot tasks, bounding boxes, grasping points, etc. (System 2 thinking FTW!) - π0 → robotic actions The π0 was finetuned just for pick-and-place Lego bricks only, and it generalizes beautifully to all kinds of tasks. However, there's lots of room for improvement when it comes to grasping & accuracy. Things that could help: - Conditioning on grasping points - Better data collection (I'm not that great at teleop) - Lots more synthetic data and simulations

25,485 Aufrufe • vor 1 Jahr

Learning from Human Demonstrations: Show the Robot How to Act! The pipeline is very similar to older experiments using Gemini & pi0 with LeRobot. Pi-zero runs locally, while Gemini Flash generates the affordances and the high-level task. (More details are in the thread.) The new component is learning from demonstrations via Gemini 2.5 Pro. I capture a video while demoing & take one of the last frames. Gemini 2.5 Pro then extracts the instructions & passes them to Gemini Flash to process the scene. The fun part is that there's no fancy insight that came from me; other than the days spent figuring out the right prompts. It's the bitter lesson hitting you in the face -> Enhanced Gemini capabilities make this possible. For example, Gemini Flash cannot do Russian doll stacking, but Gemini 2.5 Pro can do it consistently. The current limitation is low-level manipulation: - As you can see, I'm aligning the objects so they are easy to grasp using the same technique from the training data. I couldn't get Gemini Flash to consistently output an accurate grasping angle, and Gemini 1.5 Pro was too expensive and slow for real-time deployment. - Getting a symmetrical gripper should also help a lot. Adding rubber to the tips would probably also help prevent objects from slipping. Collecting & curating the data was the most time consuming & labor intensive part. Next, to improve low-level manipulation and make the system more real-time, I'm shifting to focus more on sims & synthetic data. This aligns better with my core competence. I'm open to tips and suggestions.

Learning from Human Demonstrations: Show the Robot How to Act! The pipeline is very similar to older experiments using Gemini & pi0 with LeRobot. Pi-zero runs locally, while Gemini Flash generates the affordances and the high-level task. (More details are in the thread.) The new component is learning from demonstrations via Gemini 2.5 Pro. I capture a video while demoing & take one of the last frames. Gemini 2.5 Pro then extracts the instructions & passes them to Gemini Flash to process the scene. The fun part is that there's no fancy insight that came from me; other than the days spent figuring out the right prompts. It's the bitter lesson hitting you in the face -> Enhanced Gemini capabilities make this possible. For example, Gemini Flash cannot do Russian doll stacking, but Gemini 2.5 Pro can do it consistently. The current limitation is low-level manipulation: - As you can see, I'm aligning the objects so they are easy to grasp using the same technique from the training data. I couldn't get Gemini Flash to consistently output an accurate grasping angle, and Gemini 1.5 Pro was too expensive and slow for real-time deployment. - Getting a symmetrical gripper should also help a lot. Adding rubber to the tips would probably also help prevent objects from slipping. Collecting & curating the data was the most time consuming & labor intensive part. Next, to improve low-level manipulation and make the system more real-time, I'm shifting to focus more on sims & synthetic data. This aligns better with my core competence. I'm open to tips and suggestions.

22,555 Aufrufe • vor 1 Jahr

Keine weiteren Inhalte verfügbar