
Shreyas Gite
@shreyasgite • 1,454 subscribers
AI products at https://t.co/8SkahBzqk0. Prev: founded self-driving at Kopernikus (acq. @Ford), jet engines @RollsRoyce.
Shorts
Videos

Having seen so many laundry folding robot demos, I started training with my bimanual setup, and gotta say I have new respect for those companies and individuals. Even collecting 200 episodes is tricky, and I cannot easily (yet!) bootstrap learning with affordances like I was able to do for lego sorting. Also realized that for bimanual; data collected by left-handed and right-handed people would look different. Super painful dealing with deformable fabric compared to solid objects.
Shreyas Gite35,511 Aufrufe • vor 9 Monaten

Gemini + π0 = actually useful robots! (Similar to what Physical Intelligence did with "Hi Robot") I can now verbally tell the robot that I'm building a red Lego wall or wooden tower, and it will infer the next steps by itself and pass me the necessary pieces, tools, or materials, ha! You can also just ask it to bring you things! The pipeline works as follows: - OpenAI Whisper (local) → speech to text - Gemini → makes sense of user requests, converts to robot tasks, bounding boxes, grasping points, etc. (System 2 thinking FTW!) - π0 → robotic actions The π0 was finetuned just for pick-and-place Lego bricks only, and it generalizes beautifully to all kinds of tasks. However, there's lots of room for improvement when it comes to grasping & accuracy. Things that could help: - Conditioning on grasping points - Better data collection (I'm not that great at teleop) - Lots more synthetic data and simulations
Shreyas Gite25,404 Aufrufe • vor 1 Jahr

Learning from Human Demonstrations: Show the Robot How to Act! The pipeline is very similar to older experiments using Gemini & pi0 with LeRobot. Pi-zero runs locally, while Gemini Flash generates the affordances and the high-level task. (More details are in the thread.) The new component is learning from demonstrations via Gemini 2.5 Pro. I capture a video while demoing & take one of the last frames. Gemini 2.5 Pro then extracts the instructions & passes them to Gemini Flash to process the scene. The fun part is that there's no fancy insight that came from me; other than the days spent figuring out the right prompts. It's the bitter lesson hitting you in the face -> Enhanced Gemini capabilities make this possible. For example, Gemini Flash cannot do Russian doll stacking, but Gemini 2.5 Pro can do it consistently. The current limitation is low-level manipulation: - As you can see, I'm aligning the objects so they are easy to grasp using the same technique from the training data. I couldn't get Gemini Flash to consistently output an accurate grasping angle, and Gemini 1.5 Pro was too expensive and slow for real-time deployment. - Getting a symmetrical gripper should also help a lot. Adding rubber to the tips would probably also help prevent objects from slipping. Collecting & curating the data was the most time consuming & labor intensive part. Next, to improve low-level manipulation and make the system more real-time, I'm shifting to focus more on sims & synthetic data. This aligns better with my core competence. I'm open to tips and suggestions.
Shreyas Gite22,480 Aufrufe • vor 1 Jahr
Keine weiteren Inhalte verfügbar