Milind S's banner

Milind S

@milindlabs • 1,174 subscribers

Check out SupaMaus!!

Videos

Anya Rossi

sweetdream.ai

SweetDream.ai•Sponsored•Livecam

Watch Anya Live

Anya is streaming live right now! Join her private show and enjoy exclusive content.

Exclusive private shows

1.2k viewers online

Private Show

Join now for exclusive access

Free preview available • Premium content

I’m reverse-engineering the Google AI mouse pointer and making it Open Source! It sees your screen. You can freeform paint anything and ask AI to modify it. It understands the app or window you mean. It can click, type, edit, navigate, and even write + execute code. A tiny cursor becoming a real computer-use agent. Made with the help of Cua which is an AMAZING reference for computer use Sorry Google DeepMind but I had to do this heyclicky Started a revolution and we will be the ones to win Also credits to Omkar Satpute for the help Feel free to modify it , update it and make it your own, we gotta beat them at this. Link to the repo in the comments:

I’m reverse-engineering the Google AI mouse pointer and making it Open Source! It sees your screen. You can freeform paint anything and ask AI to modify it. It understands the app or window you mean. It can click, type, edit, navigate, and even write + execute code. A tiny cursor becoming a real computer-use agent. Made with the help of Cua which is an AMAZING reference for computer use Sorry Google DeepMind but I had to do this heyclicky Started a revolution and we will be the ones to win Also credits to Omkar Satpute for the help Feel free to modify it , update it and make it your own, we gotta beat them at this. Link to the repo in the comments:

213,009 次观看 • 2 个月前

I took Clicky and I made it 5x Faster So I saw that Farza 🇵🇰🇺🇸 was using uses Claude's vision to find UI elements on screen, send a screenshot, wait for coordinates back. It works, but it's slow. I replaced that with OmniParser V2 by Microsoft which is a local Object detection model trained specifically on UI elements. It runs on-device, detects every button, menu, and icon in 400ms, and gives me pixel-precise coordinates. No API call, no latency, no cost. The green highlights you see around the UI elements is the detection overlay and you can see as I am switching the screen it takes no time to detect and highlight which is pretty neat! With Local models improving by the day, Its the right direction for applications like these. Next up: video-synced tutorials where a YouTube tutorial pauses and waits for you to perform each action in the real app. I am not stopping!

I took Clicky and I made it 5x Faster So I saw that Farza 🇵🇰🇺🇸 was using uses Claude's vision to find UI elements on screen, send a screenshot, wait for coordinates back. It works, but it's slow. I replaced that with OmniParser V2 by Microsoft which is a local Object detection model trained specifically on UI elements. It runs on-device, detects every button, menu, and icon in 400ms, and gives me pixel-precise coordinates. No API call, no latency, no cost. The green highlights you see around the UI elements is the detection overlay and you can see as I am switching the screen it takes no time to detect and highlight which is pretty neat! With Local models improving by the day, Its the right direction for applications like these. Next up: video-synced tutorials where a YouTube tutorial pauses and waits for you to perform each action in the real app. I am not stopping!

73,191 次观看 • 3 个月前

I made a stick shift claude code app Its very fun, suppose you are working on a complex project with fable and suddenly you wanna change the model to sonnet to do a simple "can you git push this" Or "can you quickly change the color of this button" Its for everyone who loves driving a manual automatic mode coming soon. for the rest of you.

I made a stick shift claude code app Its very fun, suppose you are working on a complex project with fable and suddenly you wanna change the model to sonnet to do a simple "can you git push this" Or "can you quickly change the color of this button" Its for everyone who loves driving a manual automatic mode coming soon. for the rest of you.

18,831 次观看 • 21 天前

People are overlooking Google Gemini Realtime models for computer use It gave me sub 100ms latency with computer use. It has a much larger context window and is much cheaper as well Combine that with local OCR and local screen detection model based on Omniparser by Microsoft it works under 100ms action taking when combined with Cua I also put in a harness for Nous Research Hermes with it. You can access it all at your tip of your cursor. You can draw on your screen to give your agents a context And I am making it Open Source! Link in the Comments Sundar Pichai Min-Liang Tan Mojtaba Seyedhosseini

People are overlooking Google Gemini Realtime models for computer use It gave me sub 100ms latency with computer use. It has a much larger context window and is much cheaper as well Combine that with local OCR and local screen detection model based on Omniparser by Microsoft it works under 100ms action taking when combined with Cua I also put in a harness for Nous Research Hermes with it. You can access it all at your tip of your cursor. You can draw on your screen to give your agents a context And I am making it Open Source! Link in the Comments Sundar Pichai Min-Liang Tan Mojtaba Seyedhosseini

38,054 次观看 • 1 个月前

A screen recording is the best context you can give an AI agent. It's also the worst. A few seconds of video is thousands of images. Feed that to an agent and you instantly blow up its memory. So I built a tool that fixes it. How it works: Hit record and just do your thing. Click around, type, and talk out loud about what you're doing. While you record, it quietly captures the stuff that actually matters: your voice transcription, every click and the exact thing you clicked, what you typed, and the shortcuts you press. When you stop, it doesn't hand the agent the whole video. It writes a short, readable summary: a timeline of what you said and did. The agent reads that tiny summary first. Then, only when it needs to actually see a moment, it pulls that one exact frame from the video, like flipping to a single page instead of reading the whole book. The result: the agent understands everything you did! Full context. Less tokens.

A screen recording is the best context you can give an AI agent. It's also the worst. A few seconds of video is thousands of images. Feed that to an agent and you instantly blow up its memory. So I built a tool that fixes it. How it works: Hit record and just do your thing. Click around, type, and talk out loud about what you're doing. While you record, it quietly captures the stuff that actually matters: your voice transcription, every click and the exact thing you clicked, what you typed, and the shortcuts you press. When you stop, it doesn't hand the agent the whole video. It writes a short, readable summary: a timeline of what you said and did. The agent reads that tiny summary first. Then, only when it needs to actually see a moment, it pulls that one exact frame from the video, like flipping to a single page instead of reading the whole book. The result: the agent understands everything you did! Full context. Less tokens.

12,334 次观看 • 23 天前

i accidentally built the best research buddy And it works with my mouse draw over a section in a local PDF, saved highlight a link on any site, saved grab a comment in some forum, saved it all lands in my notch meet supamaus Its a clipboard for you and your AI

i accidentally built the best research buddy And it works with my mouse draw over a section in a local PDF, saved highlight a link on any site, saved grab a comment in some forum, saved it all lands in my notch meet supamaus Its a clipboard for you and your AI

15,510 次观看 • 1 个月前

I built the clipboard I always wanted for working with AI agents. It’s called Bluey and It’s 100% local-first. It lives under your mouse cursor. You can draw on your screen, speak to it, or type what you mean, and Bluey turns all of that into rich context for Claude Code or Codex. I’ve been using it to design websites and debug UI because I can finally say things like “move this here” or “make this section feel cleaner” while pointing at the actual screen. No more writing long explanations. No more manually describing screenshots. No more agents getting lost because they don’t know what “this” or "that" means. You can just point to it, speak or type and it will curate the best possible context for your agents Bluey captures the screenshot, annotation, transcript, app/window context, and the details your agent needs, then sends it directly into your coding session. I am giving it out for free for a while! Im building it with my buddy Omkar Satpute and we would love to hear what you think about it on Discord what context do you wish your agent could understand better?

I built the clipboard I always wanted for working with AI agents. It’s called Bluey and It’s 100% local-first. It lives under your mouse cursor. You can draw on your screen, speak to it, or type what you mean, and Bluey turns all of that into rich context for Claude Code or Codex. I’ve been using it to design websites and debug UI because I can finally say things like “move this here” or “make this section feel cleaner” while pointing at the actual screen. No more writing long explanations. No more manually describing screenshots. No more agents getting lost because they don’t know what “this” or "that" means. You can just point to it, speak or type and it will curate the best possible context for your agents Bluey captures the screenshot, annotation, transcript, app/window context, and the details your agent needs, then sends it directly into your coding session. I am giving it out for free for a while! Im building it with my buddy Omkar Satpute and we would love to hear what you think about it on Discord what context do you wish your agent could understand better?

13,342 次观看 • 1 个月前

I made the worlds best Clipboard for your AI Agents Its called Bluey and its a dumb pointer which you can use to give 10x better context to your agent And you can point at your screen, talk to it, and it will generate a really good annotated screenshot which you can now give to your AI Agent You can use it while designing your website You can use it while explaining something on your screen. And it understands you and you can be really vague about things like "move this" , modify the color "here" and it will understand what you are pointing at. Im loving using it and I built it for myself as I was tired of clicking screenshots all the time and attaching it to my codex or claude code sessions Now I can just point, speak or type and drag and drop it anywhere I am giving free access to it in the comments and I would love for you to try it out. Join the discord and I am sharing the download link there. By the way everything stays 100 percent local. No screenshots are shared anywhere!

I made the worlds best Clipboard for your AI Agents Its called Bluey and its a dumb pointer which you can use to give 10x better context to your agent And you can point at your screen, talk to it, and it will generate a really good annotated screenshot which you can now give to your AI Agent You can use it while designing your website You can use it while explaining something on your screen. And it understands you and you can be really vague about things like "move this" , modify the color "here" and it will understand what you are pointing at. Im loving using it and I built it for myself as I was tired of clicking screenshots all the time and attaching it to my codex or claude code sessions Now I can just point, speak or type and drag and drop it anywhere I am giving free access to it in the comments and I would love for you to try it out. Join the discord and I am sharing the download link there. By the way everything stays 100 percent local. No screenshots are shared anywhere!

11,917 次观看 • 1 个月前

没有更多内容可加载