Video wird geladen...

Video konnte nicht geladen werden

Zur Startseite

Introducing UI-TARS-1.5, a vision-language model that beats OpenAI Operator and Claude 3.7 on GUI Agent and Game Agent tasks. We've open-sourced a small-size version model for research purposes, more details can be found in our blog. TARS learns solely from a screen, but generalizes beyond a screen! Blog: Model: App:

85,137 Aufrufe • vor 1 Jahr •via X (Twitter)

22 Kommentare

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

UI-TARS-1.5 achieves SOTA results on several GUI benchmarks, e.g., OSWorld, WindowsAgent Arena, Online Mind2web, Android World, and ScreenSpot-Pro. These results demonstrate UI-TARS's superiority on computer use, browser use, and phone use. Also, with the GUI Tool, UI-TARS almost matches GPT-4o with the search API

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

Here's a demo from UI-TARS on GUI tasks~

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

To further assess UI-TARS-1.5 in complex, open-ended environments, we tested it on Minecraft—a popular sandbox game well-suited for evaluating embodied intelligence. Unlike static GUI benchmarks, Minecraft requires real-time decision-making in a dynamic 3D space using visual input and low-level controls (mouse and keyboard), closely reflecting real-world computer use.

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

TARS has amazing inference-time scaling ability. With more interaction rounds, TARS achieves far better performance in GUI tasks and Game tasks. The scaling curve surpasses both OpenAI CUA and Claude 3.7. We even observe performance gain when the interaction rounds are over 1000 steps.

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

Gameplay represents a critical frontier for multimodal agents, serving as an ideal testing ground for evaluating complex reasoning, decision-making, and adaptability. Games demand intuitive, common-sense reasoning and strategic foresight, making them perfect benchmarks to test and showcase the advanced cognitive capabilities of multimodal agents. To evaluate UI-TARS-1.5's gameplay proficiency, we selected 14 diverse games from Each model was allowed up to 1,000 interaction steps per game to generate execution traces, repeated across multiple runs.

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

Explore more interesting showcases of UI-TARS on

Profilbild von Chris Barber
Chris Barbervor 1 Jahr

42% on OSWorld is impressive!

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

Thanks! Will be higher sooner!

Profilbild von orange.ai
orange.aivor 1 Jahr

Impressive!

Profilbild von Cua
Cuavor 1 Jahr

soon as an agent loop in c/ua 👀

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

Sure it will be!

Profilbild von yanghan
yanghanvor 1 Jahr

nice work

Profilbild von Petr Glaser
Petr Glaservor 1 Jahr

How well can it play Pokemon? 🤔

Profilbild von Oli
Olivor 1 Jahr

looks really cool but when can we acess the larger 1.5 and will it be opensource too?

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

Sure! Soon will be

Profilbild von Oli
Olivor 1 Jahr

nice really excited to try it great work

Profilbild von Ajay Sreeram
Ajay Sreeramvor 1 Jahr

I was trying 1.5 7b, it always tries to click few pixels above diagonally. Do we need to pass screen size somewhere from desktop app?

Profilbild von chadhietala
chadhietalavor 1 Jahr

Can you give details about deployment on vLLM? It seems like the model requires a min-version of it.

Profilbild von ☼░▒▅
☼░▒▅vor 1 Jahr

plans to open source the full model?

Profilbild von Yujia Qin@ICLR2025
Yujia Qin@ICLR2025vor 1 Jahr

Soon there will be~

Profilbild von ☼░▒▅
☼░▒▅vor 1 Jahr

🥹

Profilbild von Rainmaker
Rainmakervor 2 Jahren

Here I share an XGBoost model that delivers a 25% CAGR with minimal drawdown on Visa stock. In this free Substack post I share code and commentary for a powerful Machine Learning strategy that delivers powerful returns.

Ähnliche Videos