Video yükleniyor...

Video Yüklenemedi

Ana Sayfaya Dön

Introducing UI-TARS-1.5, a vision-language model that beats OpenAI Operator and Claude 3.7 on GUI Agent and Game Agent tasks. We've open-sourced a small-size version model for research purposes, more details can be found in our blog. TARS learns solely from a screen, but generalizes beyond a screen! Blog: Model: App:

85,137 görüntüleme • 1 yıl önce •via X (Twitter)

22 Yorum

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

UI-TARS-1.5 achieves SOTA results on several GUI benchmarks, e.g., OSWorld, WindowsAgent Arena, Online Mind2web, Android World, and ScreenSpot-Pro. These results demonstrate UI-TARS's superiority on computer use, browser use, and phone use. Also, with the GUI Tool, UI-TARS almost matches GPT-4o with the search API

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

Here's a demo from UI-TARS on GUI tasks~

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

To further assess UI-TARS-1.5 in complex, open-ended environments, we tested it on Minecraft—a popular sandbox game well-suited for evaluating embodied intelligence. Unlike static GUI benchmarks, Minecraft requires real-time decision-making in a dynamic 3D space using visual input and low-level controls (mouse and keyboard), closely reflecting real-world computer use.

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

TARS has amazing inference-time scaling ability. With more interaction rounds, TARS achieves far better performance in GUI tasks and Game tasks. The scaling curve surpasses both OpenAI CUA and Claude 3.7. We even observe performance gain when the interaction rounds are over 1000 steps.

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

Gameplay represents a critical frontier for multimodal agents, serving as an ideal testing ground for evaluating complex reasoning, decision-making, and adaptability. Games demand intuitive, common-sense reasoning and strategic foresight, making them perfect benchmarks to test and showcase the advanced cognitive capabilities of multimodal agents. To evaluate UI-TARS-1.5's gameplay proficiency, we selected 14 diverse games from Each model was allowed up to 1,000 interaction steps per game to generate execution traces, repeated across multiple runs.

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

Explore more interesting showcases of UI-TARS on

Chris Barber profil fotoğrafı
Chris Barber1 yıl önce

42% on OSWorld is impressive!

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

Thanks! Will be higher sooner!

orange.ai profil fotoğrafı
orange.ai1 yıl önce

Impressive!

Cua profil fotoğrafı
Cua1 yıl önce

soon as an agent loop in c/ua 👀

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

Sure it will be!

yanghan profil fotoğrafı
yanghan1 yıl önce

nice work

Petr Glaser profil fotoğrafı
Petr Glaser1 yıl önce

How well can it play Pokemon? 🤔

Oli profil fotoğrafı
Oli1 yıl önce

looks really cool but when can we acess the larger 1.5 and will it be opensource too?

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

Sure! Soon will be

Oli profil fotoğrafı
Oli1 yıl önce

nice really excited to try it great work

Ajay Sreeram profil fotoğrafı
Ajay Sreeram1 yıl önce

I was trying 1.5 7b, it always tries to click few pixels above diagonally. Do we need to pass screen size somewhere from desktop app?

chadhietala profil fotoğrafı
chadhietala1 yıl önce

Can you give details about deployment on vLLM? It seems like the model requires a min-version of it.

☼░▒▅ profil fotoğrafı
☼░▒▅1 yıl önce

plans to open source the full model?

Yujia Qin@ICLR2025 profil fotoğrafı
Yujia Qin@ICLR20251 yıl önce

Soon there will be~

☼░▒▅ profil fotoğrafı
☼░▒▅1 yıl önce

🥹

Rainmaker profil fotoğrafı
Rainmaker2 yıl önce

Here I share an XGBoost model that delivers a 25% CAGR with minimal drawdown on Visa stock. In this free Substack post I share code and commentary for a powerful Machine Learning strategy that delivers powerful returns.

Benzer Videolar

The teams shipping AI agents right now are bleeding money on the dumbest possible expense: teaching a 400B-parameter model to read a file name. Every time an AI agent needs to "see" something today, it routes an image through a frontier model. OCR, object detection, checking if a button exists on screen. You're paying GPT-4o or Claude pricing for tasks that require perception, not reasoning. One agent workflow processing a few thousand screenshots per day can burn through more on vision calls than on the actual thinking. Perceptron's Isaac is 2B parameters. Built by the team that created Meta's Chameleon multimodal models. On perceptive benchmarks, it matches or beats models 50x its size. The VQA, OCR, and object detection scores are competitive with models running on infrastructure that costs orders of magnitude more. The MCP wrapper is the distribution play. One install command and every Claude Code agent can offload vision tasks to a model that runs on a single consumer GPU. The agent keeps its reasoning in the frontier model and routes perception to a specialist. That split is how you get vision-heavy agent workflows from "technically possible but expensive" to "cheap enough to run on everything." This is the same pattern that won in every other compute-intensive stack. General-purpose handles orchestration. Specialists handle the heavy lifting. Graphics went through it. Audio went through it. Video encoding went through it. Vision in AI agents is next. The teams building agents that see 10,000 images a day will care about this before anyone else does.

Aakash Gupta

55,978 görüntüleme • 2 ay önce