Video yükleniyor...
Video Yüklenemedi
Microsoft released a groundbreaking model that can be used for web automation, with MIT license 🔥👏 OmniParser is a state-of-the-art UI parsing/understanding model that outperforms GPT4V in parsing. 👏
473,069 görüntüleme • 1 yıl önce •via X (Twitter)
9 Yorum

Model: Interesting highlight for me was Mind2Web (a benchmark for web navigation) capabilities of the model, which unlocks agentic behavior for RPA agents. no need for hefty web automation pipelines that get broken when the website/app design changes! Amazing work.

Lastly, the authors also fine-tune this model on open-set detection for interactable regions and see if they can use it as a plug-in for VLMs and it actually outperforms off-the-shelf open-set detectors like GroundingDINO. 👏

Here's a bunch of i/o examples for the model ⇓

I saw your post and made me think of which I had just come across. Would be interested in hearing from @skyvernai about the possibility of using OmniParser to replace their current approach

Incorrect, it has AGPL- 3.0 license since it is based on YOLOv8 by Ultralytics which has AGPL- 3.0 license. You can use it comercially, however your code must be publicly availabe, which makes it comercially unviable again.

It's weird to compare to GPT-4V which is notoriously bad at image understanding and OCR, right? I'd be curious to know how it fares agains 4o, Sonnet, Gemini Flash and Pro, etc.

Might be nice to combine with anthropic's computer use

Your thread is going viral! #TopUnroll 🙏🏼@dl4senses for 🥇unroll

@threadreaderapp unroll



