Video yükleniyor...

Video Yüklenemedi

Bu video yüklenirken bir sorun oluştu. Bu geçici bir ağ sorunundan kaynaklanıyor olabilir veya video kullanılamıyor olabilir.

Ana Sayfaya Dön

Introducing UI-TARS-1.5, a vision-language model that beats OpenAI Operator and Claude 3.7 on GUI Agent and Game Agent tasks. We've open-sourced a small-size version model for research purposes, more details can be found in our blog. TARS learns solely from a screen, but generalizes beyond a screen! Blog: Model: App:

Yujia Qin

5,618 subscribers

85,137 görüntüleme • 1 yıl önce •via X (Twitter)

Oyun Bilim & Teknoloji Eğitim

Anya Rossi• Live Now

Private livecam show

22 Yorum

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

UI-TARS-1.5 achieves SOTA results on several GUI benchmarks, e.g., OSWorld, WindowsAgent Arena, Online Mind2web, Android World, and ScreenSpot-Pro. These results demonstrate UI-TARS's superiority on computer use, browser use, and phone use. Also, with the GUI Tool, UI-TARS almost matches GPT-4o with the search API

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

Here's a demo from UI-TARS on GUI tasks~

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

To further assess UI-TARS-1.5 in complex, open-ended environments, we tested it on Minecraft—a popular sandbox game well-suited for evaluating embodied intelligence. Unlike static GUI benchmarks, Minecraft requires real-time decision-making in a dynamic 3D space using visual input and low-level controls (mouse and keyboard), closely reflecting real-world computer use.

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

TARS has amazing inference-time scaling ability. With more interaction rounds, TARS achieves far better performance in GUI tasks and Game tasks. The scaling curve surpasses both OpenAI CUA and Claude 3.7. We even observe performance gain when the interaction rounds are over 1000 steps.

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

Gameplay represents a critical frontier for multimodal agents, serving as an ideal testing ground for evaluating complex reasoning, decision-making, and adaptability. Games demand intuitive, common-sense reasoning and strategic foresight, making them perfect benchmarks to test and showcase the advanced cognitive capabilities of multimodal agents. To evaluate UI-TARS-1.5's gameplay proficiency, we selected 14 diverse games from Each model was allowed up to 1,000 interaction steps per game to generate execution traces, repeated across multiple runs.

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

Explore more interesting showcases of UI-TARS on

Chris Barber profil fotoğrafı

Chris Barber1 yıl önce

42% on OSWorld is impressive!

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

Thanks! Will be higher sooner!

orange.ai profil fotoğrafı

orange.ai1 yıl önce

Impressive!

Cua profil fotoğrafı

Cua1 yıl önce

soon as an agent loop in c/ua 👀

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

Sure it will be!

yanghan profil fotoğrafı

yanghan1 yıl önce

nice work

Petr Glaser profil fotoğrafı

Petr Glaser1 yıl önce

How well can it play Pokemon? 🤔

Oli profil fotoğrafı

Oli1 yıl önce

looks really cool but when can we acess the larger 1.5 and will it be opensource too?

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

Sure! Soon will be

Oli profil fotoğrafı

Oli1 yıl önce

nice really excited to try it great work

Ajay Sreeram profil fotoğrafı

Ajay Sreeram1 yıl önce

I was trying 1.5 7b, it always tries to click few pixels above diagonally. Do we need to pass screen size somewhere from desktop app?

chadhietala profil fotoğrafı

chadhietala1 yıl önce

Can you give details about deployment on vLLM? It seems like the model requires a min-version of it.

☼░▒▅ profil fotoğrafı

☼░▒▅1 yıl önce

plans to open source the full model?

Yujia Qin@ICLR2025 profil fotoğrafı

Yujia Qin@ICLR20251 yıl önce

Soon there will be~

☼░▒▅ profil fotoğrafı

☼░▒▅1 yıl önce

🥹

Rainmaker profil fotoğrafı

Rainmaker2 yıl önce

Here I share an XGBoost model that delivers a 25% CAGR with minimal drawdown on Visa stock. In this free Substack post I share code and commentary for a powerful Machine Learning strategy that delivers powerful returns.

Benzer Videolar

New LLMs that control UIs! ByteDance Research releases UI-TARS, fine-tuned GUI agent that integrates reasoning, and action capabilities into a single vision-language model. Think of computer use but open. 👀 TL;DR; 3️⃣ Available in 3 sizes: 2B, 7B, and 72B parameters 🧠 Trained Qwen2-VL models with SFT & DPO 🥇 72B version achieves 82.8% on VisualWebBench (beating GPT-4 and Claude) 🏆 Achieves state-of-the-art results on 10+ GUI agent benchmarks 💡 Reasons before taking an action 🧑🏻‍💻 Can Click, Long Press, type, scroll, open app, navigate back/home, wait 🤗 Released under Apache 2.0 on Hugging Face

New LLMs that control UIs! ByteDance Research releases UI-TARS, fine-tuned GUI agent that integrates reasoning, and action capabilities into a single vision-language model. Think of computer use but open. 👀 TL;DR; 3️⃣ Available in 3 sizes: 2B, 7B, and 72B parameters 🧠 Trained Qwen2-VL models with SFT & DPO 🥇 72B version achieves 82.8% on VisualWebBench (beating GPT-4 and Claude) 🏆 Achieves state-of-the-art results on 10+ GUI agent benchmarks 💡 Reasons before taking an action 🧑🏻‍💻 Can Click, Long Press, type, scroll, open app, navigate back/home, wait 🤗 Released under Apache 2.0 on Hugging Face

Philipp Schmid

48,157 görüntüleme • 1 yıl önce

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

Introducing Meta Perception Language Model (PLM): an open & reproducible vision-language model tackling challenging visual tasks. Learn more about how PLM can help the open source community build more capable computer vision systems. Read the research paper, and download the code and dataset:

AI at Meta

94,330 görüntüleme • 1 yıl önce

Sub-agent Model Selection — Different Tasks, Different Models Your main agent runs Qwen3.6-Plus for quality. But not every subtask needs a flagship model. Now sub-agents can use a different model. Create a skill file with model: openai:qwen3.5-plus and the sub-agent runs on that model. Powerful model for the hard parts, fast model for the easy parts. Save tokens without sacrificing quality on what matters.

Sub-agent Model Selection — Different Tasks, Different Models Your main agent runs Qwen3.6-Plus for quality. But not every subtask needs a flagship model. Now sub-agents can use a different model. Create a skill file with model: openai:qwen3.5-plus and the sub-agent runs on that model. Powerful model for the hard parts, fast model for the easy parts. Save tokens without sacrificing quality on what matters.

Qwen

21,333 görüntüleme • 2 ay önce

A research preview of Operator, an agent that can use its own browser to perform tasks for you.

A research preview of Operator, an agent that can use its own browser to perform tasks for you.

OpenAI

3,936,394 görüntüleme • 1 yıl önce

🏆 Introducing TARS 1/ Your AI Executive Assistant that Turns Natural Language into Action. -> With TARS, you can seamlessly orchestrate multiple AI agents to gather insights, automate tasks, and take actions across platforms — all in one place. -> Build upon the 1st productized multi-agent orchestration (MAO) ecosystem designed for AI Agentic workflow. 🏠 Explore more:

🏆 Introducing TARS 1/ Your AI Executive Assistant that Turns Natural Language into Action. -> With TARS, you can seamlessly orchestrate multiple AI agents to gather insights, automate tasks, and take actions across platforms — all in one place. -> Build upon the 1st productized multi-agent orchestration (MAO) ecosystem designed for AI Agentic workflow. 🏠 Explore more:

Questflow

85,907 görüntüleme • 1 yıl önce

We are excited to share a research preview of our generative agent. The agent is being trained to solve the hardest tasks in 3D and beyond, using only keyboard and mouse actions. Join the waitlist: Our agent app runs on Windows or Mac, either locally or with one-click setup for a Windows VM. It’s still early days, but this paves the way for production-level workflows for the first time ever. Blog:

We are excited to share a research preview of our generative agent. The agent is being trained to solve the hardest tasks in 3D and beyond, using only keyboard and mouse actions. Join the waitlist: Our agent app runs on Windows or Mac, either locally or with one-click setup for a Windows VM. It’s still early days, but this paves the way for production-level workflows for the first time ever. Blog:

Common Sense Machines

155,199 görüntüleme • 1 yıl önce

The teams shipping AI agents right now are bleeding money on the dumbest possible expense: teaching a 400B-parameter model to read a file name. Every time an AI agent needs to "see" something today, it routes an image through a frontier model. OCR, object detection, checking if a button exists on screen. You're paying GPT-4o or Claude pricing for tasks that require perception, not reasoning. One agent workflow processing a few thousand screenshots per day can burn through more on vision calls than on the actual thinking. Perceptron's Isaac is 2B parameters. Built by the team that created Meta's Chameleon multimodal models. On perceptive benchmarks, it matches or beats models 50x its size. The VQA, OCR, and object detection scores are competitive with models running on infrastructure that costs orders of magnitude more. The MCP wrapper is the distribution play. One install command and every Claude Code agent can offload vision tasks to a model that runs on a single consumer GPU. The agent keeps its reasoning in the frontier model and routes perception to a specialist. That split is how you get vision-heavy agent workflows from "technically possible but expensive" to "cheap enough to run on everything." This is the same pattern that won in every other compute-intensive stack. General-purpose handles orchestration. Specialists handle the heavy lifting. Graphics went through it. Audio went through it. Video encoding went through it. Vision in AI agents is next. The teams building agents that see 10,000 images a day will care about this before anyone else does.

The teams shipping AI agents right now are bleeding money on the dumbest possible expense: teaching a 400B-parameter model to read a file name. Every time an AI agent needs to "see" something today, it routes an image through a frontier model. OCR, object detection, checking if a button exists on screen. You're paying GPT-4o or Claude pricing for tasks that require perception, not reasoning. One agent workflow processing a few thousand screenshots per day can burn through more on vision calls than on the actual thinking. Perceptron's Isaac is 2B parameters. Built by the team that created Meta's Chameleon multimodal models. On perceptive benchmarks, it matches or beats models 50x its size. The VQA, OCR, and object detection scores are competitive with models running on infrastructure that costs orders of magnitude more. The MCP wrapper is the distribution play. One install command and every Claude Code agent can offload vision tasks to a model that runs on a single consumer GPU. The agent keeps its reasoning in the frontier model and routes perception to a specialist. That split is how you get vision-heavy agent workflows from "technically possible but expensive" to "cheap enough to run on everything." This is the same pattern that won in every other compute-intensive stack. General-purpose handles orchestration. Specialists handle the heavy lifting. Graphics went through it. Audio went through it. Video encoding went through it. Vision in AI agents is next. The teams building agents that see 10,000 images a day will care about this before anyone else does.

Aakash Gupta

55,978 görüntüleme • 2 ay önce

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Apple just released and open-sourced FastVLM! FastVLM is a lightning-fast vision-language model that combines rapid image and text understanding with efficient on-device performance. 100% Open Source

Sumanth

43,693 görüntüleme • 9 ay önce

🚀 Meta FAIR is releasing several new research artifacts on our road to advanced machine intelligence (AMI). These latest advancements are transforming our understanding of perception. 1️⃣ Meta Perception Encoder: A large-scale vision encoder that excels across several image & video tasks. 2️⃣ Meta Perception Language Model: A fully open & reproducible vision-language model designed to tackle visual recognition tasks. 3️⃣ Meta Locate 3D: An end-to-end model for accurate object localization in 3D environments. 4️⃣ Releasing model weights for our 8B-parameter Dynamic Byte Latent Transformer, an alternative to traditional tokenization methods with the potential to redefine the standards for language model efficiency and reliability. 5️⃣Collaborative Reasoner: A framework for evaluating & improving collaborative reasoning skills in language models. Download the code, datasets, and research papers and learn more about how these artifacts are paving the way for more efficient and accurate AI systems.➡️

🚀 Meta FAIR is releasing several new research artifacts on our road to advanced machine intelligence (AMI). These latest advancements are transforming our understanding of perception. 1️⃣ Meta Perception Encoder: A large-scale vision encoder that excels across several image & video tasks. 2️⃣ Meta Perception Language Model: A fully open & reproducible vision-language model designed to tackle visual recognition tasks. 3️⃣ Meta Locate 3D: An end-to-end model for accurate object localization in 3D environments. 4️⃣ Releasing model weights for our 8B-parameter Dynamic Byte Latent Transformer, an alternative to traditional tokenization methods with the potential to redefine the standards for language model efficiency and reliability. 5️⃣Collaborative Reasoner: A framework for evaluating & improving collaborative reasoning skills in language models. Download the code, datasets, and research papers and learn more about how these artifacts are paving the way for more efficient and accurate AI systems.➡️

AI at Meta

163,214 görüntüleme • 1 yıl önce

📈 Today, we're launching Agent Mode in Excel on Windows and Mac. An Excel Copilot that can work with you and make edits in your spreadsheet like an expert collaborator. We're also introducing a new, multi-model system that supports both GPT models from OpenAI and Claude models from Anthropic. Wanna try Claude Opus 4.5 natively in Microsoft Excel? We've got you. Give it a shot and let me know what you think.

📈 Today, we're launching Agent Mode in Excel on Windows and Mac. An Excel Copilot that can work with you and make edits in your spreadsheet like an expert collaborator. We're also introducing a new, multi-model system that supports both GPT models from OpenAI and Claude models from Anthropic. Wanna try Claude Opus 4.5 natively in Microsoft Excel? We've got you. Give it a shot and let me know what you think.

Trevor O'Brien

74,821 görüntüleme • 5 ay önce

1. Meta’s open-sourced multisensory model Meta is back (again!) with yet another exciting open-source project. Introducing ImageBind, a new AI research model that understands and combines text, audio, visual, movement, thermal, AND depth data.

1. Meta’s open-sourced multisensory model Meta is back (again!) with yet another exciting open-source project. Introducing ImageBind, a new AI research model that understands and combines text, audio, visual, movement, thermal, AND depth data.

Rowan Cheung

173,984 görüntüleme • 3 yıl önce

How I generated a high quality 3D model in Blender, from a simple 2D image, using Claude 3.7! Screen capture + link + step-by-step process below👇

How I generated a high quality 3D model in Blender, from a simple 2D image, using Claude 3.7! Screen capture + link + step-by-step process below👇

Emm | scenario.com

169,478 görüntüleme • 1 yıl önce

🔥We're thrilled to announce: ShowUI Local Run!🔥 🧑‍💻Now, you can use our 2B vision-language-action model for Local Computer control! 💰30x Cheaper than Claude! 🔗Model: 🔗Computer Use OOTB: #ComputerUse #Agent #Claude

🔥We're thrilled to announce: ShowUI Local Run!🔥 🧑‍💻Now, you can use our 2B vision-language-action model for Local Computer control! 💰30x Cheaper than Claude! 🔗Model: 🔗Computer Use OOTB: #ComputerUse #Agent #Claude

Kevin Lin

14,539 görüntüleme • 1 yıl önce

🤯ByteDance just Open Sourced UI-TARS - 2 SOTA models (7B & 72B) + a PC/MacOS app to control your computer with vLMS And they are not messing around, beating GPT-4o and Claude, SOTA across 10 benchmarks Will you be installing this on your pc?

🤯ByteDance just Open Sourced UI-TARS - 2 SOTA models (7B & 72B) + a PC/MacOS app to control your computer with vLMS And they are not messing around, beating GPT-4o and Claude, SOTA across 10 benchmarks Will you be installing this on your pc?

Alex Volkov

69,738 görüntüleme • 1 yıl önce

Introducing RAGs, a Streamlit app that allows you to create and customize your own RAG agent and then use it over your own data, all with natural language 🔥 Directly inspired by OpenAI GPTs, you can converse with an agent to help you do search/retrieval over any data you specify. The app contains three main pages: 🏠 Home Page : Have a “builder agent” build your RAG agent through natural language (you specify the data). ⚙️ RAG Config: Look at configured parameters 🤖 Use your RAG agent! Check out details below 👇 Blog: Repo:

Introducing RAGs, a Streamlit app that allows you to create and customize your own RAG agent and then use it over your own data, all with natural language 🔥 Directly inspired by OpenAI GPTs, you can converse with an agent to help you do search/retrieval over any data you specify. The app contains three main pages: 🏠 Home Page : Have a “builder agent” build your RAG agent through natural language (you specify the data). ⚙️ RAG Config: Look at configured parameters 🤖 Use your RAG agent! Check out details below 👇 Blog: Repo:

LlamaIndex 🦙

475,732 görüntüleme • 2 yıl önce

This approach has made Sonnet the model of choice for developers worldwide. In addition to our new model, we're launching Claude Code, our first coding tool, in a limited research preview. With Claude Code, you can delegate substantial tasks to Claude—right from your terminal.

This approach has made Sonnet the model of choice for developers worldwide. In addition to our new model, we're launching Claude Code, our first coding tool, in a limited research preview. With Claude Code, you can delegate substantial tasks to Claude—right from your terminal.

Anthropic

1,140,188 görüntüleme • 1 yıl önce

You can now automate any task on your phone by letting AI control it AutoGLM from Zai is a 100% open source vision-language model that: - Understands what's on your screen - Acts autonomously from a prompt - Totally private (works LOCALLY) Tutorial ↓

You can now automate any task on your phone by letting AI control it AutoGLM from Zai is a 100% open source vision-language model that: - Understands what's on your screen - Acts autonomously from a prompt - Totally private (works LOCALLY) Tutorial ↓

Paul Couvert

169,735 görüntüleme • 6 ay önce

How can agents understand the world from diverse language? 🌎 Excited to introduce Dynalang, an agent that learns to understand language by 𝙢𝙖𝙠𝙞𝙣𝙜 𝙥𝙧𝙚𝙙𝙞𝙘𝙩𝙞𝙤𝙣𝙨 𝙖𝙗𝙤𝙪𝙩 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 with a multimodal world model!

How can agents understand the world from diverse language? 🌎 Excited to introduce Dynalang, an agent that learns to understand language by 𝙢𝙖𝙠𝙞𝙣𝙜 𝙥𝙧𝙚𝙙𝙞𝙘𝙩𝙞𝙤𝙣𝙨 𝙖𝙗𝙤𝙪𝙩 𝙩𝙝𝙚 𝙛𝙪𝙩𝙪𝙧𝙚 with a multimodal world model!

Jessy Lin

107,491 görüntüleme • 2 yıl önce

We’ve been working with OpenAI for the past few weeks to test their latest Computer-using Agent model. On our evals, CUA has set a new SOTA. Once integrated with an agent, it can complete long-horizon tasks previously impossible. Try CUA on our playground and Act SDK for free!

We’ve been working with OpenAI for the past few weeks to test their latest Computer-using Agent model. On our evals, CUA has set a new SOTA. Once integrated with an agent, it can complete long-horizon tasks previously impossible. Try CUA on our playground and Act SDK for free!

Scrapybara

89,458 görüntüleme • 1 yıl önce