Video wird geladen...

Video konnte nicht geladen werden

Beim Laden dieses Videos ist ein Problem aufgetreten. Dies könnte an einem vorübergehenden Netzwerkproblem liegen oder das Video ist möglicherweise nicht verfügbar.

How well do today’s frontier models handle long-horizon, multi-step web agent tasks, such as identifying the top 25 U.S. CS PhD programs with ML/AI faculty likely accepting students and compiling the results into a structured sheet? Check out our new work on Odysseys: Benchmarking Web Agents on Realistic Long... Horizon Tasks Paper: Leaderboard: We introduce Odysseys, a benchmark of 200 long-horizon tasks derived from real browsing sessions and evaluated on the live Internet. We show that binary pass/fail is inadequate in this setting and propose rubric-based evaluation, which better aligns with human judgment and provides more informative signals. Across leading models, the best achieves only 44.5% success, leaving substantial headroom. We further introduce a Trajectory Efficiency metric (rubric score per step) and find efficiency remains extremely low (1.15%), highlighting a key bottleneck. Odysseys provides a realistic benchmark for measuring progress toward web agents capable of sustained, efficient, real-world operation. See a more detailed thread by Jing Yu Koh.show more

Russ Salakhutdinov

112,224 subscribers

22,518 Aufrufe • vor 1 Monat •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 Kommentare

Keine Kommentare verfügbar

Kommentare vom Original-Post werden hier angezeigt

Ähnliche Videos

🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering daily and professional scenarios Toolathlon reveals significant shortcomings of SOTA LLMs in realistic tool-use tasks, where Claude Sonnet 4.5 achieves 38.6% success rate. It also indicates a clear gap between open-source and leading proprietary models. Check our blog: Github: Paper: 🧵⬇️

🚀We are excited to introduce the Tool Decathlon (Toolathlon), a benchmark for language agents on diverse, complex, and realistic tool use. ⭐️32 applications and 600+ tools based on real-world software environments ⭐️Execution-based, reliable evaluation ⭐️Realistic, covering daily and professional scenarios Toolathlon reveals significant shortcomings of SOTA LLMs in realistic tool-use tasks, where Claude Sonnet 4.5 achieves 38.6% success rate. It also indicates a clear gap between open-source and leading proprietary models. Check our blog: Github: Paper: 🧵⬇️

Junxian He

43,652 Aufrufe • vor 7 Monaten

LLM agents have demonstrated promise in their ability to automate computer tasks, but face challenges with multi-step reasoning and planning. Towards addressing this, we propose an inference-time tree search algorithm for LLM agents to explicitly perform exploration and multi-step planning in interactive web environments. It is the first tree search algorithm for LLM agents that shows effectiveness on realistic and complex web environments: on the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%.

LLM agents have demonstrated promise in their ability to automate computer tasks, but face challenges with multi-step reasoning and planning. Towards addressing this, we propose an inference-time tree search algorithm for LLM agents to explicitly perform exploration and multi-step planning in interactive web environments. It is the first tree search algorithm for LLM agents that shows effectiveness on realistic and complex web environments: on the challenging VisualWebArena benchmark, applying our search algorithm on top of a GPT-4o agent yields a 39.7% relative increase in success rate compared to the same baseline without search, setting a state-of-the-art success rate of 26.4%. On WebArena, search also yields a 28.0% relative improvement over a baseline agent, setting a competitive success rate of 19.2%.

Jing Yu Koh

124,857 Aufrufe • vor 2 Jahren

Need help in the real world? RoboVQA can guide robots and humans through long-horizon tasks on a phone via Google Meet. We release a dataset of 800k (video, question/answer) with robots & humans doing various long-horizon tasks. Data: Google DeepMind

Need help in the real world? RoboVQA can guide robots and humans through long-horizon tasks on a phone via Google Meet. We release a dataset of 800k (video, question/answer) with robots & humans doing various long-horizon tasks. Data: Google DeepMind

Pierre Sermanet

64,369 Aufrufe • vor 2 Jahren

Super excited to share 🧠MLGym 🦾 – the first Gym environment for AI Research Agents 🤖🔬 We introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. The key contributions of our work are: 🕹️ Enables the exploration of different training algorithms for AI Research Agents such as RL 🛠️ Provides a flexible evaluation framework that can accommodate different artifacts such as models, algorithms, or predictions 🤖 Allows researchers to evaluate any model without the need to develop a custom agentic harness 🎯 Introduces 13 diverse open-ended AI Research tasks for evaluating AI Research Agents on a wide range of domains such as computer vision, natural language processing, reinforcement learning, game theory, and logical reasoning. 📈 Proposes a new evaluation metric for AI Research Agents MLGym makes it easy to: 1) Add new tasks 2) Evaluate new models 3) Integrate new agents Check out a video of the MLGym Agent to see how it performs the full pipeline of idea generation💡, implementation 👩‍💻, experimentation 👩‍🔬, and iteration 🔄 to improve on ML tasks. Huge thanks to the exceptionally talented Deepak Nathani who led this work and to all the other amazing collaborators who made this possible 🙏🫶🚀

Super excited to share 🧠MLGym 🦾 – the first Gym environment for AI Research Agents 🤖🔬 We introduce MLGym and MLGym-Bench, a new framework and benchmark for evaluating and developing LLM agents on AI research tasks. The key contributions of our work are: 🕹️ Enables the exploration of different training algorithms for AI Research Agents such as RL 🛠️ Provides a flexible evaluation framework that can accommodate different artifacts such as models, algorithms, or predictions 🤖 Allows researchers to evaluate any model without the need to develop a custom agentic harness 🎯 Introduces 13 diverse open-ended AI Research tasks for evaluating AI Research Agents on a wide range of domains such as computer vision, natural language processing, reinforcement learning, game theory, and logical reasoning. 📈 Proposes a new evaluation metric for AI Research Agents MLGym makes it easy to: 1) Add new tasks 2) Evaluate new models 3) Integrate new agents Check out a video of the MLGym Agent to see how it performs the full pipeline of idea generation💡, implementation 👩‍💻, experimentation 👩‍🔬, and iteration 🔄 to improve on ML tasks. Huge thanks to the exceptionally talented Deepak Nathani who led this work and to all the other amazing collaborators who made this possible 🙏🫶🚀

Roberta Raileanu

104,964 Aufrufe • vor 1 Jahr

INTELLIGENT TASKS ARE A STEPPING STONE TO AGI Today, we are launching ChatLLM Tasks. We have hooked up tools like web search, email, and web scrappers to mini-agents that can be triggered on a schedule. These tasks combine intelligent tool use with crons! We think this is the first step towards AGI. The next step is to connect our AI engineer to create more complex tasks. AGI STEP ONE - DONE!

INTELLIGENT TASKS ARE A STEPPING STONE TO AGI Today, we are launching ChatLLM Tasks. We have hooked up tools like web search, email, and web scrappers to mini-agents that can be triggered on a schedule. These tasks combine intelligent tool use with crons! We think this is the first step towards AGI. The next step is to connect our AI engineer to create more complex tasks. AGI STEP ONE - DONE!

Bindu Reddy

18,486 Aufrufe • vor 1 Jahr

Computer use agents are slow and brittle. The fix isn’t just stronger models, but also deploying them as multi-agent systems. MACU is a general Multi-Agent Computer Use framework that consistently lifts success rates by 3.4-25.5% and is up to 1.5x faster on long-horizon tasks.🧵

Computer use agents are slow and brittle. The fix isn’t just stronger models, but also deploying them as multi-agent systems. MACU is a general Multi-Agent Computer Use framework that consistently lifts success rates by 3.4-25.5% and is up to 1.5x faster on long-horizon tasks.🧵

Jing Yu Koh

27,484 Aufrufe • vor 19 Tagen

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models paper page: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal language models, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. In our experiments, JARVIS-1 exhibits nearly perfect performances across over 200 varying tasks from the Minecraft Universe Benchmark, ranging from entry to intermediate levels. JARVIS-1 has achieved a completion rate of 12.5% in the long-horizon diamond pickaxe task. This represents a significant increase up to 5 times compared to previous records. Furthermore, we show that JARVIS-1 is able to self-improve following a life-long learning paradigm thanks to multimodal memory, sparking a more general intelligence and improved autonomy.

AK

141,416 Aufrufe • vor 2 Jahren

π-0.5, a model doing long-horizon tasks in real, unseen homes with unseen objects! Physical Intelligence one of my favorites below at 10x sharing a few of my favorite results in thread, with many more detailed ablations in the paper:

π-0.5, a model doing long-horizon tasks in real, unseen homes with unseen objects! Physical Intelligence one of my favorites below at 10x sharing a few of my favorite results in thread, with many more detailed ablations in the paper:

Brian Ichter

11,913 Aufrufe • vor 1 Jahr

Rubric AI (Rubric AI (YC W26)) converts expert judgment into training signals for AI models and agents. The human + computational layer for AI reasoning and verification. Congrats on the launch, Pragya Saboo!

Rubric AI (Rubric AI (YC W26)) converts expert judgment into training signals for AI models and agents. The human + computational layer for AI reasoning and verification. Congrats on the launch, Pragya Saboo!

Y Combinator

16,157 Aufrufe • vor 3 Monaten

Introducing Scalable Option Learning (SOL☀️), a blazingly fast hierarchical RL algorithm that makes progress on long-horizon tasks and demonstrates positive scaling trends on the largely unsolved NetHack benchmark, when trained for 30 billion samples. Details, paper and code in >

Introducing Scalable Option Learning (SOL☀️), a blazingly fast hierarchical RL algorithm that makes progress on long-horizon tasks and demonstrates positive scaling trends on the largely unsolved NetHack benchmark, when trained for 30 billion samples. Details, paper and code in >

Mikael Henaff

20,957 Aufrufe • vor 8 Monaten

Sharing our latest short course: Building and Evaluating Data Agents, created in collaboration with Snowflake and taught by Anupam Datta (Anupam Datta) and Josh Reini (Josh Reini). A data agent extracts data from sources such as files or databases, analyzes it, and provides insights and visualizes its findings. But most data agents struggle with reliability or can't handle multi-step reasoning. In this course, you'll learn to build, trace, and evaluate a multi-agent workflow that plans tasks, pulls context from structured and unstructured data, performs web search, and summarizes or visualizes the final results. Learn more and enroll for free!

Sharing our latest short course: Building and Evaluating Data Agents, created in collaboration with Snowflake and taught by Anupam Datta (Anupam Datta) and Josh Reini (Josh Reini). A data agent extracts data from sources such as files or databases, analyzes it, and provides insights and visualizes its findings. But most data agents struggle with reliability or can't handle multi-step reasoning. In this course, you'll learn to build, trace, and evaluate a multi-agent workflow that plans tasks, pulls context from structured and unstructured data, performs web search, and summarizes or visualizes the final results. Learn more and enroll for free!

DeepLearning.AI

40,745 Aufrufe • vor 9 Monaten

How capable are web agents at solving knowledge work tasks? 🤔 Are LLMs up to the challenge? 🤖 Introducing WorkArena: a benchmark where agents meet the world 𝘸𝘪𝘭𝘥 web of enterprise software 🌐🖥️ Paper: Website: 🧵 1/7

How capable are web agents at solving knowledge work tasks? 🤔 Are LLMs up to the challenge? 🤖 Introducing WorkArena: a benchmark where agents meet the world 𝘸𝘪𝘭𝘥 web of enterprise software 🌐🖥️ Paper: Website: 🧵 1/7

Alexandre Lacoste

24,504 Aufrufe • vor 2 Jahren

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with Lawrence Jang) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!

One of the things I’m most excited about this year is building agents that can work productively for hours, days, or weeks. Coding agents are starting to become very competent at this, but what about computer use agents? Our new benchmark, Odysseys (co-led with Lawrence Jang) is a set of 200 new tasks derived from real world browsing behavior that measure long horizon web navigation capabilities (potentially up to hours of web browsing work). Interestingly, we find that frontier CUAs are already surprisingly good at working productively for up to an hour on these tasks, but there’s a lot of work to be done in making them even more efficient. Like every other AI researcher, my real dream is to open a cafe once we solve ASI. So, here’s Opus 4.6 doing some market research for me ("I want to do market research on the most popular cafes in Singapore. Analyse the menus of the top 10 cafes in Singapore (by Google reviews/ratings), and make sure we include at least 1 from the North/South/East/West/Central regions of Singapore. Keep the relevant pages of each cafe open, and summarise their pricing, menu offerings, unique selling points, making sure to reference which tab is opened for each cafe. For each cafe, also help me figure out how long it would take to get to it from Tampines MRT, and include this in your final summary."). I was very impressed to see Opus 4.6 complete this task after working for 52 mins, satisfying all 7 rubrics that corresponded to this task. It provided a very nice markdown summary at the end that gave me all the information I asked for!

Jing Yu Koh

49,333 Aufrufe • vor 1 Monat

📢 Announcing one of the most exciting works from us this year on **scalable robot policy evaluation through real-to-sim transfer**, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is **orders of magnitude slower** than the development of language agents. We propose a new framework where simulation performance **strongly correlates** with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing **state-of-the-art correlation** between simulation and reality for deformable object manipulation. It provides a **scalable and reproducible evaluation engine** for robot learning. 🌐

📢 Announcing one of the most exciting works from us this year on scalable robot policy evaluation through real-to-sim transfer, moving toward a scalable evaluation engine with structured world models that capture the appearance, geometry, and dynamics of environments involving deformable objects. 🤖 Evaluation remains one of the biggest bottlenecks in building general-purpose robots. Today, robots are still evaluated only in the real world, which is orders of magnitude slower than the development of language agents. We propose a new framework where simulation performance strongly correlates with the real world (r > 0.9), even for deformable objects. The key difference from existing work lies in the correlation between simulation and reality: if a robot model performs better in the digital world, does it also perform better in the real world? This question has long made people hesitant about simulation-based evaluation — especially for deformable objects. We are changing that. Our pipeline achieves effective real-to-sim transfer, establishing state-of-the-art correlation between simulation and reality for deformable object manipulation. It provides a scalable and reproducible evaluation engine for robot learning. 🌐

Yunzhu Li

39,850 Aufrufe • vor 7 Monaten

Introducing 🦀 CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents 🦀 CRAB provides an end-to-end and easy-to-use framework to build multimodal agents, operate environments, and create benchmarks to evaluate them, featuring three key components: - 🔀 Cross-environment support - agents can operate tasks in 📱 Android and 💻 Ubuntu. - 🕸️ Graph evaluator - provides a fine-grain evaluation metric for agents. - 🤖 Task generation - composes subtasks to automatically generate tasks. By connecting all devices to agents, 🦀CRAB unlocks greater capabilities for human-like tasks than ever before. Use 🦀 CRAB to benchmark your multimodal agents! - 👨‍💻 Check out the repository: - 📝 Read the paper: - 🌐 Find out more via the project page: - 🐫 Join our community:

Introducing 🦀 CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents 🦀 CRAB provides an end-to-end and easy-to-use framework to build multimodal agents, operate environments, and create benchmarks to evaluate them, featuring three key components: - 🔀 Cross-environment support - agents can operate tasks in 📱 Android and 💻 Ubuntu. - 🕸️ Graph evaluator - provides a fine-grain evaluation metric for agents. - 🤖 Task generation - composes subtasks to automatically generate tasks. By connecting all devices to agents, 🦀CRAB unlocks greater capabilities for human-like tasks than ever before. Use 🦀 CRAB to benchmark your multimodal agents! - 👨‍💻 Check out the repository: - 📝 Read the paper: - 🌐 Find out more via the project page: - 🐫 Join our community:

CAMEL-AI.org

72,661 Aufrufe • vor 1 Jahr

Google has launched Med-Gemini, an advanced AI fine-tuned for medical Tasks. It significantly outperforms earlier models, including GPT-4, on most medical benchmarks. Achieves top scores, particularly on the MedQA-USMLE benchmark with a groundbreaking 91.1% accuracy.🚀 Demonstrates superior performance over GPT-4 by 44.5% on average across seven multimodal benchmarks. Excels in tasks such as medical summarization, generating doctor referrals, and simplifying medical documents. It is a preferred method over human expert analyses for complex text-based medical tasks. This marks a significant advancement in AI for healthcare, suggesting potential improvements in medical diagnostics and patient care. By choosing to harmonize with technology, we become more human, we become IRREPLACEABLE. Join the IRREPLACEABLE Academy and read the Book: #techforgood #medicine #ai

Google has launched Med-Gemini, an advanced AI fine-tuned for medical Tasks. It significantly outperforms earlier models, including GPT-4, on most medical benchmarks. Achieves top scores, particularly on the MedQA-USMLE benchmark with a groundbreaking 91.1% accuracy.🚀 Demonstrates superior performance over GPT-4 by 44.5% on average across seven multimodal benchmarks. Excels in tasks such as medical summarization, generating doctor referrals, and simplifying medical documents. It is a preferred method over human expert analyses for complex text-based medical tasks. This marks a significant advancement in AI for healthcare, suggesting potential improvements in medical diagnostics and patient care. By choosing to harmonize with technology, we become more human, we become IRREPLACEABLE. Join the IRREPLACEABLE Academy and read the Book: #techforgood #medicine #ai

Pascal Bornet

14,393 Aufrufe • vor 1 Jahr

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Long-horizon visual goals remain surprisingly hard for robot manipulation. We introduce Act2Goal, a goal-conditioned policy that uses a visual world model to reason about progress toward a goal, and practice it autonomously in the real world.

Jianlan Luo

95,100 Aufrufe • vor 5 Monaten

Learn how to build a Gemini 3 Deep Agents with this video from LangChain. These agents can reason through complex, long horizon tasks by: > Breaking goals into actionable steps > Delegating specific work for specialized models > Leveraging file systems and code execution

Learn how to build a Gemini 3 Deep Agents with this video from LangChain. These agents can reason through complex, long horizon tasks by: > Breaking goals into actionable steps > Delegating specific work for specialized models > Leveraging file systems and code execution

Google AI Developers

19,484 Aufrufe • vor 6 Monaten

We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency. Our team has developed highly capable and efficient models that can be run on just 2 GPUs. Check out our tech report to learn more:

We’re redefining what’s possible with AI. With the release of our latest model, Command A, optimized for real-world agentic and multilingual tasks, we’re demonstrating our commitment to bringing enterprises AI that goes beyond the ordinary, and offers security & efficiency. Our team has developed highly capable and efficient models that can be run on just 2 GPUs. Check out our tech report to learn more:

Cohere

16,413 Aufrufe • vor 1 Jahr

Today we're thrilled to introduce Jace, your AI employee. Jace goes beyond AI chatbots by being able to handle longer-running tasks and taking actions in the digital world. By using our new AWA-1 (Autonomous Web Agent) model, Jace can use a browser to interact with websites just like any human would. This allows it to handle real-world tasks like researching and booking flights, handling a hiring process, or even setting up a company. We can't wait to see how Jace will help you! Join us at See some examples of what Jace can do in this thread:

Today we're thrilled to introduce Jace, your AI employee. Jace goes beyond AI chatbots by being able to handle longer-running tasks and taking actions in the digital world. By using our new AWA-1 (Autonomous Web Agent) model, Jace can use a browser to interact with websites just like any human would. This allows it to handle real-world tasks like researching and booking flights, handling a hiring process, or even setting up a company. We can't wait to see how Jace will help you! Join us at See some examples of what Jace can do in this thread:

Viktor

198,736 Aufrufe • vor 2 Jahren