
Milind S
@milindlabs • 1,099 subscribers
Building TipTour dot io Making website tour guides fun
Videos

I’m reverse-engineering the Google AI mouse pointer and making it Open Source! It sees your screen. You can freeform paint anything and ask AI to modify it. It understands the app or window you mean. It can click, type, edit, navigate, and even write + execute code. A tiny cursor becoming a real computer-use agent. Made with the help of Cua which is an AMAZING reference for computer use Sorry Google DeepMind but I had to do this heyclicky Started a revolution and we will be the ones to win Also credits to Omkar Satpute for the help Feel free to modify it , update it and make it your own, we gotta beat them at this. Link to the repo in the comments:
Milind S210,879 次观看 • 26 天前

People are overlooking Google Gemini Realtime models for computer use It gave me sub 100ms latency with computer use. It has a much larger context window and is much cheaper as well Combine that with local OCR and local screen detection model based on Omniparser by Microsoft it works under 100ms action taking when combined with Cua I also put in a harness for Nous Research Hermes with it. You can access it all at your tip of your cursor. You can draw on your screen to give your agents a context And I am making it Open Source! Link in the Comments Sundar Pichai Min-Liang Tan Mojtaba Seyedhosseini
Milind S37,538 次观看 • 7 天前

I made the worlds best Clipboard for your AI Agents Its called Bluey and its a dumb pointer which you can use to give 10x better context to your agent And you can point at your screen, talk to it, and it will generate a really good annotated screenshot which you can now give to your AI Agent You can use it while designing your website You can use it while explaining something on your screen. And it understands you and you can be really vague about things like "move this" , modify the color "here" and it will understand what you are pointing at. Im loving using it and I built it for myself as I was tired of clicking screenshots all the time and attaching it to my codex or claude code sessions Now I can just point, speak or type and drag and drop it anywhere I am giving free access to it in the comments and I would love for you to try it out. Join the discord and I am sharing the download link there. By the way everything stays 100 percent local. No screenshots are shared anywhere!
Milind S11,602 次观看 • 5 天前

I took Clicky and I made it 5x Faster So I saw that Farza 🇵🇰🇺🇸 was using uses Claude's vision to find UI elements on screen, send a screenshot, wait for coordinates back. It works, but it's slow. I replaced that with OmniParser V2 by Microsoft which is a local Object detection model trained specifically on UI elements. It runs on-device, detects every button, menu, and icon in 400ms, and gives me pixel-precise coordinates. No API call, no latency, no cost. The green highlights you see around the UI elements is the detection overlay and you can see as I am switching the screen it takes no time to detect and highlight which is pretty neat! With Local models improving by the day, Its the right direction for applications like these. Next up: video-synced tutorials where a YouTube tutorial pauses and waits for you to perform each action in the real app. I am not stopping!
Milind S72,849 次观看 • 1 个月前
没有更多内容可加载