正在加载视频...

视频加载失败

加载此视频时出现问题。这可能是由于临时网络问题，或视频可能不可用。

Microsoft presents Windows Agent Arena Evaluating Multi-Modal OS Agents at Scale discuss: Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since:... (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena.show more

AK

502,458 subscribers

19,684 次观看 • 1 年前 •via X (Twitter)

Anya Rossi• Live Now

Private livecam show

0 条评论

暂无评论

原始帖子的评论将显示在这里

相关视频

Stop spending hours on manual work. You can now use a multi-agent AI workforce to get more work done in less time. Here's how 👇 --- Try Eigent AI - Lets you build and run a custom AI workforce on your desktop. - Automate complex workflows using multi-agent task execution. - Built on CAMEL-AI’s top open-source projects ( CAMEL-AI.org & OWL). - Boost productivity with deep customization and strong privacy --- Features: - Customize Your AI Workforce: Build task-specific agents with domain skills and tools. - Faster Execution: Eigent runs agents in parallel to automate complex workflows. - Human-in-the-loop: Automatically asks for help when tasks hit uncertainty. --- What sets Eigent apart? - 3–5× faster task execution using a parallel multi-agent workforce. - Modular design lets you add new capabilities without changing the core system. - Self-optimizing agents that replan and adapt during execution for higher success. - Deploy anywhere: cloud, local, or enterprise, with full open-source flexibility. --- Try building your multi-agent AI workforce here: Join their community to build your multi-agent workforce: Check their GitHub: ---

Stop spending hours on manual work. You can now use a multi-agent AI workforce to get more work done in less time. Here's how 👇 --- Try Eigent AI - Lets you build and run a custom AI workforce on your desktop. - Automate complex workflows using multi-agent task execution. - Built on CAMEL-AI’s top open-source projects ( CAMEL-AI.org & OWL). - Boost productivity with deep customization and strong privacy --- Features: - Customize Your AI Workforce: Build task-specific agents with domain skills and tools. - Faster Execution: Eigent runs agents in parallel to automate complex workflows. - Human-in-the-loop: Automatically asks for help when tasks hit uncertainty. --- What sets Eigent apart? - 3–5× faster task execution using a parallel multi-agent workforce. - Modular design lets you add new capabilities without changing the core system. - Self-optimizing agents that replan and adapt during execution for higher success. - Deploy anywhere: cloud, local, or enterprise, with full open-source flexibility. --- Try building your multi-agent AI workforce here: Join their community to build your multi-agent workforce: Check their GitHub: ---

Shushant Lakhyani

20,423 次观看 • 11 个月前

We are investing in the frontiers of agentic capabilities with a few early prototypes. Project Mariner is built with Gemini 2.0 and is able to understand and reason across information - pixels, text, code, images + forms - on your browser screen, and then uses that info to complete tasks for you. When evaluated against the WebVoyager benchmark, it achieved a state-of-the-art result of 83.5% working as a single agent setup.

We are investing in the frontiers of agentic capabilities with a few early prototypes. Project Mariner is built with Gemini 2.0 and is able to understand and reason across information - pixels, text, code, images + forms - on your browser screen, and then uses that info to complete tasks for you. When evaluated against the WebVoyager benchmark, it achieved a state-of-the-art result of 83.5% working as a single agent setup.

Sundar Pichai

219,170 次观看 • 1 年前

Multi-robot learning is getting a serious boost! 📚 Researchers have extended Isaac Lab to train heterogeneous multi-agent robotic policies at scale. The new framework supports high-resolution physics, GPU-accelerated simulation, and both homogeneous and heterogeneous agents working together on coordination tasks. They benchmarked different approaches (MAPPO: Multi-Agent Proximal Policy Optimization and HAPPO: Heterogeneous Agent PPO) across six challenging scenarios and showed that large-scale multi-robot training is not only feasible, but efficient. It’s an important step for real-world robotic collaboration, where teams of robots need to coordinate, split tasks, adapt roles, and interact dynamically, not just operate as identical clones. The code is open-source, and it pushes Isaac Lab closer to what robotics actually needs: scalable, physics-driven environments where many different robots can learn to work together. Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Multi-robot learning is getting a serious boost! 📚 Researchers have extended Isaac Lab to train heterogeneous multi-agent robotic policies at scale. The new framework supports high-resolution physics, GPU-accelerated simulation, and both homogeneous and heterogeneous agents working together on coordination tasks. They benchmarked different approaches (MAPPO: Multi-Agent Proximal Policy Optimization and HAPPO: Heterogeneous Agent PPO) across six challenging scenarios and showed that large-scale multi-robot training is not only feasible, but efficient. It’s an important step for real-world robotic collaboration, where teams of robots need to coordinate, split tasks, adapt roles, and interact dynamically, not just operate as identical clones. The code is open-source, and it pushes Isaac Lab closer to what robotics actually needs: scalable, physics-driven environments where many different robots can learn to work together. Here's the project page: ~~ ♻️ Join the weekly robotics newsletter, and never miss any news →

Lukas Ziegler

38,997 次观看 • 7 个月前

Frameworks such as ai16zdao's Eliza and Virtuals Protocol have been instrumental in early AI agent developments. Agent swarms working in hierarchy represents for many the next logical step in unlocking the vast potential of AI. Learn below how Shadō Network achieves this. AI agents launched through current popular platforms have individual personas, on-chain functions and access to data via various APIs. This being said, they operate in isolated environments, with a ceiling on emergent behaviour such as collaboration or competition. Shadō Network invites massive expansion for capabilities of both new and existing AI agents, with an open-source package easily integrated into popular frameworks that enables the launching of stratified agent swarms. Our website is live: The "Shadō Play" package provides a modular, configurable platform for creating or employing agents of choice in a swarm-like setup, opening a Pandora’s box of near infinite emergent agent behaviours, relationships and functionalities. Users will be able to make use of various prefab client integrations such as Twitter, Telegram, Ollama, and others to specify swarms to their needs or create their own extensions to enhance agent capabilities even further. Agents operate with a memory module and a HTN for autonomously deciding which interactions to act on, walking the line between autonomy and configurability. The Shadō Network project’s development is supported by our ghostly friend Omnipotent (👻,👻), an AI agent developed by the Shadō Network team trained on and fine tuned with a multitude of academic data related to artificial intelligence, blockchain, finance, software engineering, world building and more. Omnipotent serves as both an interactive steward for the project and as an asset - regularly scanning social platforms, websites and newsfeeds he is capable of providing the team project development advice, whilst also communicating with the wider world via his automated X account (launching soon). Shado Network is collaborative and open-sourced. Agentic Swarms require a developer swarm to maximize the technical capabilities and impact the greatest number of users. Our dedicated team of core contributors are active in other web3 AI repos and are here to guide project direction and foster growth. We’re facilitators, not gatekeepers... Alone we can go fast but together we can go far. A lot more to come soon. 👻

Frameworks such as ai16zdao's Eliza and Virtuals Protocol have been instrumental in early AI agent developments. Agent swarms working in hierarchy represents for many the next logical step in unlocking the vast potential of AI. Learn below how Shadō Network achieves this. AI agents launched through current popular platforms have individual personas, on-chain functions and access to data via various APIs. This being said, they operate in isolated environments, with a ceiling on emergent behaviour such as collaboration or competition. Shadō Network invites massive expansion for capabilities of both new and existing AI agents, with an open-source package easily integrated into popular frameworks that enables the launching of stratified agent swarms. Our website is live: The "Shadō Play" package provides a modular, configurable platform for creating or employing agents of choice in a swarm-like setup, opening a Pandora’s box of near infinite emergent agent behaviours, relationships and functionalities. Users will be able to make use of various prefab client integrations such as Twitter, Telegram, Ollama, and others to specify swarms to their needs or create their own extensions to enhance agent capabilities even further. Agents operate with a memory module and a HTN for autonomously deciding which interactions to act on, walking the line between autonomy and configurability. The Shadō Network project’s development is supported by our ghostly friend Omnipotent (👻,👻), an AI agent developed by the Shadō Network team trained on and fine tuned with a multitude of academic data related to artificial intelligence, blockchain, finance, software engineering, world building and more. Omnipotent serves as both an interactive steward for the project and as an asset - regularly scanning social platforms, websites and newsfeeds he is capable of providing the team project development advice, whilst also communicating with the wider world via his automated X account (launching soon). Shado Network is collaborative and open-sourced. Agentic Swarms require a developer swarm to maximize the technical capabilities and impact the greatest number of users. Our dedicated team of core contributors are active in other web3 AI repos and are here to guide project direction and foster growth. We’re facilitators, not gatekeepers... Alone we can go fast but together we can go far. A lot more to come soon. 👻

Shadō Network | シャドウネットワーク

23,546 次观看 • 1 年前

OpenAI's AgentKit will be so insane, build every step of agents on one platform. These visual agent builders make the whole process of iterating and launching agents far more efficient. It sits on top of the Responses API and unifies the tools that were previously scattered across SDKs and custom orchestration. It lets developers create agent workflows visually, connect data sources securely, and measure performance automatically without coding every layer by hand. The core of AgentKit is the Agent Builder, a drag-and-drop canvas where each node represents an action, guardrail, or decision branch. Developers can link these nodes into multi-agent workflows, preview results instantly, and version each setup. It supports inline evaluation so that developers can see how changes affect output before deploying. The Connector Registry is a single admin panel that manages how data and tools connect across the OpenAI ecosystem. It centralizes integrations like Google Drive, SharePoint, Dropbox, and Microsoft Teams. Large organizations can govern access and flow of data between agents securely under one global console. ChatKit provides a ready-to-use chat interface for embedding agents inside apps or websites. It manages streaming, message threads, and model reasoning displays automatically. Developers can skin the interface to match their product without writing custom front-end code. Under the hood, all these blocks use the same execution core that runs agent reasoning through OpenAI’s APIs. Workflows in Agent Builder compile down to structured instructions for the Responses API, which handles model calls, tool use, and context passing. Connector Registry handles authentication and routing for external tools, while Evals and RFT provide feedback loops that improve agents over time. This integration means developers no longer need to handle orchestration logic, model evaluation pipelines, or safety layers separately. Everything runs natively within OpenAI’s control plane with managed security, automatic versioning, and built-in testing. In short, AgentKit standardizes the entire life cycle of an AI agent—from visual design to deployment and performance tuning—inside a single unified system.

OpenAI's AgentKit will be so insane, build every step of agents on one platform. These visual agent builders make the whole process of iterating and launching agents far more efficient. It sits on top of the Responses API and unifies the tools that were previously scattered across SDKs and custom orchestration. It lets developers create agent workflows visually, connect data sources securely, and measure performance automatically without coding every layer by hand. The core of AgentKit is the Agent Builder, a drag-and-drop canvas where each node represents an action, guardrail, or decision branch. Developers can link these nodes into multi-agent workflows, preview results instantly, and version each setup. It supports inline evaluation so that developers can see how changes affect output before deploying. The Connector Registry is a single admin panel that manages how data and tools connect across the OpenAI ecosystem. It centralizes integrations like Google Drive, SharePoint, Dropbox, and Microsoft Teams. Large organizations can govern access and flow of data between agents securely under one global console. ChatKit provides a ready-to-use chat interface for embedding agents inside apps or websites. It manages streaming, message threads, and model reasoning displays automatically. Developers can skin the interface to match their product without writing custom front-end code. Under the hood, all these blocks use the same execution core that runs agent reasoning through OpenAI’s APIs. Workflows in Agent Builder compile down to structured instructions for the Responses API, which handles model calls, tool use, and context passing. Connector Registry handles authentication and routing for external tools, while Evals and RFT provide feedback loops that improve agents over time. This integration means developers no longer need to handle orchestration logic, model evaluation pipelines, or safety layers separately. Everything runs natively within OpenAI’s control plane with managed security, automatic versioning, and built-in testing. In short, AgentKit standardizes the entire life cycle of an AI agent—from visual design to deployment and performance tuning—inside a single unified system.

Rohan Paul

178,460 次观看 • 9 个月前

Codex update 0.105.0 is out! Despite the fairly pedestrian changelog, this one's a doosie. It's a laundry list of quality of life improvements across the board. - Wispr Voice dictation (hold space to talk) - Theme picker - Codex can prevent sleep on Linux & Windows (I just know there's a joke in there) - Customize Plan Mode reasoning - Many other fixes/updates There's also a complete overhaul to subagents: - New names for readability - Visual display overhaul (way cleaner) - Allow for multi-layered subagent depth (max_depth) - Custom multi-agent role definitions (custom subagents) - /agents now shows both agent names, agent roles, and "dead agents" for auditibility This is the largest single update of Codex I've ever seen! Absolutely massive if you love to use multi-agents. To turn on Voice Transcription, enable: [features] voice_transcription = true Does not work on Linux yet. Well done OpenAI Developers 👏

Codex update 0.105.0 is out! Despite the fairly pedestrian changelog, this one's a doosie. It's a laundry list of quality of life improvements across the board. - Wispr Voice dictation (hold space to talk) - Theme picker - Codex can prevent sleep on Linux & Windows (I just know there's a joke in there) - Customize Plan Mode reasoning - Many other fixes/updates There's also a complete overhaul to subagents: - New names for readability - Visual display overhaul (way cleaner) - Allow for multi-layered subagent depth (max_depth) - Custom multi-agent role definitions (custom subagents) - /agents now shows both agent names, agent roles, and "dead agents" for auditibility This is the largest single update of Codex I've ever seen! Absolutely massive if you love to use multi-agents. To turn on Voice Transcription, enable: [features] voice_transcription = true Does not work on Linux yet. Well done OpenAI Developers 👏

am.will

115,392 次观看 • 4 个月前

LangGraph. CrewAI. Agno. Which one to pick? The good news is that this will not matter soon! Finally, we have a full picture of how the industry is solving this with just three open protocols that work across ALL frameworks. It's not about picking the best framework. Instead, it's about understanding how protocols create interoperability. The Agent Protocol Landscape shows how three complementary protocols are creating a universal language for Agents: > AG-UI (Agent-User Interaction): - The bi-directional connection between agentic backends and frontends. - This is how agents become truly interactive inside your apps, not just as chatbots, but collaborative co-workers. > MCP (Model Context Protocol): - The standard for how agents connect to tools, data, and workflows. > A2A (Agent-to-Agent): - The protocol for multi-agent coordination. - How agents delegate tasks and share intent across systems. These aren't competing standards. They're layers of the same stack and have handshakes with each other. So instead of building point-to-point integrations, you build to protocols. Moreover, you can integrate LangGraph, CrewAI, or Agno into the same frontend, without rewriting your UI logic. These protocols let everything work together. For instance: - Your LangGraph agent pulls data via MCP. - It delegates analysis to a CrewAI agent via A2A. - Results stream to your React app via AG-UI. - Users see real-time collaboration in your interface. This way, you can focus on building agent capabilities instead of integration mechanics. The protocols handle interoperability automatically. CopilotKit unifies this entire stack into one framework so you can build "Cursor for X" style apps without implementing each protocol from scratch. It gives you all three protocols, generative UI support, and production-ready infrastructure in one framework. I have shared this playbook in the replies! It breaks down handshakes, misconceptions, and real examples and shows exactly how to start building.

LangGraph. CrewAI. Agno. Which one to pick? The good news is that this will not matter soon! Finally, we have a full picture of how the industry is solving this with just three open protocols that work across ALL frameworks. It's not about picking the best framework. Instead, it's about understanding how protocols create interoperability. The Agent Protocol Landscape shows how three complementary protocols are creating a universal language for Agents: > AG-UI (Agent-User Interaction): - The bi-directional connection between agentic backends and frontends. - This is how agents become truly interactive inside your apps, not just as chatbots, but collaborative co-workers. > MCP (Model Context Protocol): - The standard for how agents connect to tools, data, and workflows. > A2A (Agent-to-Agent): - The protocol for multi-agent coordination. - How agents delegate tasks and share intent across systems. These aren't competing standards. They're layers of the same stack and have handshakes with each other. So instead of building point-to-point integrations, you build to protocols. Moreover, you can integrate LangGraph, CrewAI, or Agno into the same frontend, without rewriting your UI logic. These protocols let everything work together. For instance: - Your LangGraph agent pulls data via MCP. - It delegates analysis to a CrewAI agent via A2A. - Results stream to your React app via AG-UI. - Users see real-time collaboration in your interface. This way, you can focus on building agent capabilities instead of integration mechanics. The protocols handle interoperability automatically. CopilotKit unifies this entire stack into one framework so you can build "Cursor for X" style apps without implementing each protocol from scratch. It gives you all three protocols, generative UI support, and production-ready infrastructure in one framework. I have shared this playbook in the replies! It breaks down handshakes, misconceptions, and real examples and shows exactly how to start building.

Avi Chawla

30,762 次观看 • 8 个月前

OpenAI has introduced the ChatGPT Agent, which handles complex multi-step tasks from research to automation. Genspark goes further in some areas: In addition to user-friendly office tools (Slides, Docs, Sheets, AI Secretary, AI Drive), Genspark scores with dynamic tool orchestration and an intelligent feedback loop - a clear added value, especially for individuals and small teams. ChatGPT Agent Offers browser and API access, terminal control and deep search capabilities. Strengths include high security mechanisms, comprehensive user control and integration with productivity tools such as Gmail and Calendar. Ideal for end users and teams who need maximum control and data protection. Genspark Super Agent Enables no-code workflows, creates high-quality visual content (slides, videos) and automates entire workflows. With tool calling, the agent automatically selects the best solution from over 80 integrated tools - e.g. for CRM queries, task management or API access. The feedback loop allows the agent to monitor the use of a tool during execution and dynamically switch to another tool or adapt the workflow if necessary. Thanks to this multi-model architecture, Genspark often works more precisely and efficiently in benchmarks than comparable systems.

OpenAI has introduced the ChatGPT Agent, which handles complex multi-step tasks from research to automation. Genspark goes further in some areas: In addition to user-friendly office tools (Slides, Docs, Sheets, AI Secretary, AI Drive), Genspark scores with dynamic tool orchestration and an intelligent feedback loop - a clear added value, especially for individuals and small teams. ChatGPT Agent Offers browser and API access, terminal control and deep search capabilities. Strengths include high security mechanisms, comprehensive user control and integration with productivity tools such as Gmail and Calendar. Ideal for end users and teams who need maximum control and data protection. Genspark Super Agent Enables no-code workflows, creates high-quality visual content (slides, videos) and automates entire workflows. With tool calling, the agent automatically selects the best solution from over 80 integrated tools - e.g. for CRM queries, task management or API access. The feedback loop allows the agent to monitor the use of a tool during execution and dynamically switch to another tool or adapt the workflow if necessary. Thanks to this multi-model architecture, Genspark often works more precisely and efficiently in benchmarks than comparable systems.

Chubby♨️

176,267 次观看 • 1 年前

The entire timeline is filled with talks on sentient and all, but I love being as informative and precise as possible on pressing issues. Let’s quickly talk about @SentientAGI’s Recursive Open Meta Agent (ROMA); ROMA is an open-source meta-agent framework used to build high performance multi-agent systems. ROMA serves as the conductor in a mass choir, or a captain of a ship . The captain gives commands for the other subordinates to follow to ensure efficiency on all sides. In this like manner, it provides a hierarchical tress system where the parent agents break down complex tasks to create simpler subtasks that are then passed on to children nodes. A family tree has the parents above, likewise the same tree analogy works here, but that’s not all that makes it stand out The results and solutions gotten by these child nodes are then aggregated together and there’s an up flow of results sent back up to the parent nodes. And at the center of it all is ROMA engineering and making sure all is running smoothly without break or fail. Are you really bullish on Sentient and the future of AGIs?

The entire timeline is filled with talks on sentient and all, but I love being as informative and precise as possible on pressing issues. Let’s quickly talk about @SentientAGI’s Recursive Open Meta Agent (ROMA); ROMA is an open-source meta-agent framework used to build high performance multi-agent systems. ROMA serves as the conductor in a mass choir, or a captain of a ship . The captain gives commands for the other subordinates to follow to ensure efficiency on all sides. In this like manner, it provides a hierarchical tress system where the parent agents break down complex tasks to create simpler subtasks that are then passed on to children nodes. A family tree has the parents above, likewise the same tree analogy works here, but that’s not all that makes it stand out The results and solutions gotten by these child nodes are then aggregated together and there’s an up flow of results sent back up to the parent nodes. And at the center of it all is ROMA engineering and making sure all is running smoothly without break or fail. Are you really bullish on Sentient and the future of AGIs?

OHJAY ⭕️ || 🇬🇧

23,521 次观看 • 9 个月前

We’re excited to introduce Text-to-LoRA: a Hypernetwork that generates task-specific LLM adapters (LoRAs) based on a text description of the task. Catch our presentation at #ICML2025! Paper: Code: Biological systems are capable of rapid adaptation, given limited sensory cues. For example, our human visual system can quickly adapt and tune its light sensitivity to our surroundings. While modern LLMs exhibit a wide variety of capabilities and knowledge, they remain rigid when adding task-specific capabilities. Traditionally, customizing these models requires gathering large datasets and performing often expensive, time-consuming fine-tuning for specific applications. To bypass these limitations, Text-to-LoRA (T2L) meta-learns a “hypernetwork” that takes in a text description of a desired task, as a prompt, and generates a task-specific LoRA that performs well on the task. In our experiments, we show that T2L can encode hundreds of existing LoRA adapters. While the compression is lossy, T2L maintains the performance of task-specifically tuned LoRA adapters. We also show that T2L can even generalize to unseen tasks given a natural language description of the tasks. Importantly, Text-to-LoRA is parameter-efficient. It generates LoRAs in a single, inexpensive step, based solely on a simple text description of the task. This approach is a step towards dramatically lowering the technical and computational barriers, allowing non-technical users to specialize foundation models using plain language, rather than needing deep technical expertise or large compute resources.

We’re excited to introduce Text-to-LoRA: a Hypernetwork that generates task-specific LLM adapters (LoRAs) based on a text description of the task. Catch our presentation at #ICML2025! Paper: Code: Biological systems are capable of rapid adaptation, given limited sensory cues. For example, our human visual system can quickly adapt and tune its light sensitivity to our surroundings. While modern LLMs exhibit a wide variety of capabilities and knowledge, they remain rigid when adding task-specific capabilities. Traditionally, customizing these models requires gathering large datasets and performing often expensive, time-consuming fine-tuning for specific applications. To bypass these limitations, Text-to-LoRA (T2L) meta-learns a “hypernetwork” that takes in a text description of a desired task, as a prompt, and generates a task-specific LoRA that performs well on the task. In our experiments, we show that T2L can encode hundreds of existing LoRA adapters. While the compression is lossy, T2L maintains the performance of task-specifically tuned LoRA adapters. We also show that T2L can even generalize to unseen tasks given a natural language description of the tasks. Importantly, Text-to-LoRA is parameter-efficient. It generates LoRAs in a single, inexpensive step, based solely on a simple text description of the task. This approach is a step towards dramatically lowering the technical and computational barriers, allowing non-technical users to specialize foundation models using plain language, rather than needing deep technical expertise or large compute resources.

Sakana AI

403,159 次观看 • 1 年前

The Gemini 2.0 era is here. And we’re excited for you to start building with it. A quick rewind of what we just released ⏪ Gemini 2.0 Flash ⚡ comes with low latency and better performance. 🔵 You can now access an experimental version in G3mini on the web, while Gemini Advanced users can try Deep Research, a new AI research assistant. 🔵 Developers can begin building through the Gemini API in Google AI Studio and Vertex AI 2.0 is also enabling new research prototypes of AI agents, including: 🔵 Project Astra, which explores future capabilities of a universal AI assistant 🔵 Project Mariner, which shows what’s possible for human-agent interaction, starting with your browser 🔵 Jules, an experimental AI-powered coding agent Finally, we’re exploring how 2.0 can be used in agents across domains — from navigating the virtual world of video games to applying its spatial reasoning capabilities to robotics. 🤖

The Gemini 2.0 era is here. And we’re excited for you to start building with it. A quick rewind of what we just released ⏪ Gemini 2.0 Flash ⚡ comes with low latency and better performance. 🔵 You can now access an experimental version in G3mini on the web, while Gemini Advanced users can try Deep Research, a new AI research assistant. 🔵 Developers can begin building through the Gemini API in Google AI Studio and Vertex AI 2.0 is also enabling new research prototypes of AI agents, including: 🔵 Project Astra, which explores future capabilities of a universal AI assistant 🔵 Project Mariner, which shows what’s possible for human-agent interaction, starting with your browser 🔵 Jules, an experimental AI-powered coding agent Finally, we’re exploring how 2.0 can be used in agents across domains — from navigating the virtual world of video games to applying its spatial reasoning capabilities to robotics. 🤖

Google DeepMind

231,798 次观看 • 1 年前

Excited to share the KDA: Kernel Design Agents that powers HAN Lab Kernel Mafia top ranking #1~3 kernels at Kernel Contest🚀🚀🚀 Thanks to agents, everyone can be a "kernel bro" in 2026: By adapting the KDA, the team ranked #1 in MoE, #2 in DSA, and #3 in GDN in the Pure Agent track at MLSys FlashInfer Kernel Contest – especially given the fact that the main participant (dongyun zou) has only written ~400 LoC triton and 0 lines of CUDA in 2026. The core philosophy here is to leverage Humanize (the best harness framework) to let the agent run autonomously for as long as possible. By minimizing human involvement and input, and placing full trust in the agent, we can achieve kernel performance that nears SOTA levels. HAN Lab Mafia Solution to MLSys’26 Kernel Contest: KDA Github:

Excited to share the KDA: Kernel Design Agents that powers HAN Lab Kernel Mafia top ranking #1~3 kernels at Kernel Contest🚀🚀🚀 Thanks to agents, everyone can be a "kernel bro" in 2026: By adapting the KDA, the team ranked #1 in MoE, #2 in DSA, and #3 in GDN in the Pure Agent track at MLSys FlashInfer Kernel Contest – especially given the fact that the main participant (dongyun zou) has only written ~400 LoC triton and 0 lines of CUDA in 2026. The core philosophy here is to leverage Humanize (the best harness framework) to let the agent run autonomously for as long as possible. By minimizing human involvement and input, and placing full trust in the agent, we can achieve kernel performance that nears SOTA levels. HAN Lab Mafia Solution to MLSys’26 Kernel Contest: KDA Github:

Ligeng Zhu

109,445 次观看 • 2 个月前

🧃 Introducing stereOS: a Linux based operating system hardened and purpose built for AI agents. It's clear that agents need an ACTUAL operating system (not what people are calling an "OS") to witness the full breadth and depth of their capabilities while mitigating the blast radius of autonomous, untrusted actors. But there are so many problems with AI sandboxes today: * Going out to the apple store and buying a mac mini will never scale and is way too expensive (obviously) * Running in Docker is too restrictive (agents can't stand up their own container infrastructure, no sub virtualization, docker-in-docker is very broken) * Firecracker strips all the hardware so GPU PCIe passthrough, secure boot, FIPs, etc. is out of the question. * Native VMs are too fat and the overhead of 1 agent per VM is too much. stereOS takes a different approach: it's a full NixOS system that you boot and then kick off agent sandboxes inside with gVisor + /nix/store namespace mounting. Each agent gets their own kernel and the /nix/store is read only by nature. Even if the agent was somehow able to escape the gVisor virtual kernel, they'd land on the NixOS system as the "agent" user! Not your actual hardware!! If you want to take a defense-in-depth approach, we support "native" agents that run at the system level kicked off by our `agentd` utility. These agents, on their own, can manage and kick off other sub agents using the internal sandboxing mechanisms. Today, we're open sourcing all of this: * stereOS: our purpose built Linux OS - * masterblaster: client utility to launch, manage, and orchestrate agents - * stereosd: the stereOS system control plane daemon - * agentd: the stereOS system agent management daemon - Give it a try, throw us a star, and let me know what you think 🧃⭐️

🧃 Introducing stereOS: a Linux based operating system hardened and purpose built for AI agents. It's clear that agents need an ACTUAL operating system (not what people are calling an "OS") to witness the full breadth and depth of their capabilities while mitigating the blast radius of autonomous, untrusted actors. But there are so many problems with AI sandboxes today: * Going out to the apple store and buying a mac mini will never scale and is way too expensive (obviously) * Running in Docker is too restrictive (agents can't stand up their own container infrastructure, no sub virtualization, docker-in-docker is very broken) * Firecracker strips all the hardware so GPU PCIe passthrough, secure boot, FIPs, etc. is out of the question. * Native VMs are too fat and the overhead of 1 agent per VM is too much. stereOS takes a different approach: it's a full NixOS system that you boot and then kick off agent sandboxes inside with gVisor + /nix/store namespace mounting. Each agent gets their own kernel and the /nix/store is read only by nature. Even if the agent was somehow able to escape the gVisor virtual kernel, they'd land on the NixOS system as the "agent" user! Not your actual hardware!! If you want to take a defense-in-depth approach, we support "native" agents that run at the system level kicked off by our `agentd` utility. These agents, on their own, can manage and kick off other sub agents using the internal sandboxing mechanisms. Today, we're open sourcing all of this: * stereOS: our purpose built Linux OS - * masterblaster: client utility to launch, manage, and orchestrate agents - * stereosd: the stereOS system control plane daemon - * agentd: the stereOS system agent management daemon - Give it a try, throw us a star, and let me know what you think 🧃⭐️

John McBride

150,334 次观看 • 4 个月前

Increasingly, HTML Artifacts are becoming a core part of how I work with AI agents. Long-horizon agent sessions need a better way to surface insights about what work it has done. This may not be obvious right now, but as you start to let your agent work on dynamic workflows, large codebases, long-running loops (e.g., using /goal), and deep research tasks, you need a good way to present results. Chat window is not it. You also don't want to just trust everything the agents do. Artifacts help provide an important verification layer, which in turn enables important decision-making. I like HTML artifacts because I can just ask the agent to produce as many of them (and in whatever form) as I need to verify the work and make sense out of everything. I even built a nice tab system for my artifacts. They are great for continual learning and research. I use HTML artifacts for logging, tracking experiments, brainstorming, managing my inbox, code reviews, agent session management, deep research, writing, reading, and so much more. I believe Andrej Karpathy wrote about this somewhere: As we move on to more advanced applications of AI agents and outputs get more complex, we will start to find the need for even more advanced forms of interactions with AI, including interactive neural videos/simulations.

Increasingly, HTML Artifacts are becoming a core part of how I work with AI agents. Long-horizon agent sessions need a better way to surface insights about what work it has done. This may not be obvious right now, but as you start to let your agent work on dynamic workflows, large codebases, long-running loops (e.g., using /goal), and deep research tasks, you need a good way to present results. Chat window is not it. You also don't want to just trust everything the agents do. Artifacts help provide an important verification layer, which in turn enables important decision-making. I like HTML artifacts because I can just ask the agent to produce as many of them (and in whatever form) as I need to verify the work and make sense out of everything. I even built a nice tab system for my artifacts. They are great for continual learning and research. I use HTML artifacts for logging, tracking experiments, brainstorming, managing my inbox, code reviews, agent session management, deep research, writing, reading, and so much more. I believe Andrej Karpathy wrote about this somewhere: As we move on to more advanced applications of AI agents and outputs get more complex, we will start to find the need for even more advanced forms of interactions with AI, including interactive neural videos/simulations.

elvis

36,827 次观看 • 1 个月前

Back when we were developing GEN3C, we often imagined a Holodeck-like future: a simulator where multiple agents can enter the same generated world, act independently, and learn to collaborate. Gamma-World makes this feel more concrete. It is a generative multi-agent world model that takes synchronized observations and actions, then rolls out what each agent will see next in the same evolving world — action-responsive at 24 FPS. For me, the key challenge is going beyond two players. As more agents enter, identity cannot be tied to fixed slots, interaction cannot rely on dense pairwise attention, and independent actions still need to resolve into one shared state. Two ideas make this work: 1⃣ Simplex RoPE Distinct agent identities without slot bias — unique, but permutation-equivalent. 2⃣ Sparse Hub Attention Agents communicate through learnable hubs instead of dense all-to-all attention: agent → hub → agent This keeps cross-agent communication scalable. The exciting part: training on two-player data can generalize to four-player rollouts without additional training, and the same formulation extends to real-world bimanual robot coordination. A step toward populated world models: many agents, one shared world. Congrats to the team on Gamma-World! Project:

Back when we were developing GEN3C, we often imagined a Holodeck-like future: a simulator where multiple agents can enter the same generated world, act independently, and learn to collaborate. Gamma-World makes this feel more concrete. It is a generative multi-agent world model that takes synchronized observations and actions, then rolls out what each agent will see next in the same evolving world — action-responsive at 24 FPS. For me, the key challenge is going beyond two players. As more agents enter, identity cannot be tied to fixed slots, interaction cannot rely on dense pairwise attention, and independent actions still need to resolve into one shared state. Two ideas make this work: 1⃣ Simplex RoPE Distinct agent identities without slot bias — unique, but permutation-equivalent. 2⃣ Sparse Hub Attention Agents communicate through learnable hubs instead of dense all-to-all attention: agent → hub → agent This keeps cross-agent communication scalable. The exciting part: training on two-player data can generalize to four-player rollouts without additional training, and the same formulation extends to real-world bimanual robot coordination. A step toward populated world models: many agents, one shared world. Congrats to the team on Gamma-World! Project:

Xuanchi Ren

304,145 次观看 • 1 个月前

Introducing the BIOS API: Turn Your Agent Into a Research Scientist Built to: 🦞 Add biomedical workflows to your OpenClaw🦞 agent 🧠 Create research or health agents w/ on-demand scientific intelligence 🧪 Pay per query via x402 on Base Any agent or app can now tap into the BIOS AI Scientist, plugging BIOS into the broader agent economy. What is BIOS? BIOS is an AI Scientist designed to handle complex biomedical research by orchestrating specialized scientific subagents. Ranked #1 on the leading bioinformatics benchmark, BIOS is already being used by 1,000+ researchers and labs to build new drugs and medicines. An Agentic Economy for Science AI agents have proven they can form multi-billion dollar ecosystems. BIOS applies the same primitives to drug discovery pipelines and health. Instead of coding bots and personal AI assistants, think research agent swarms running on a modern scientific stack. Imagine an OpenClaw agent built for longevity: It scans new literature daily, generates novel compound hypotheses through BIOS, designs validation workflows, and routes the best candidates to wet-lab funding - all programmatically. Connect it with an agent for microbiome health, enabling agent “backrooms” that autonomously surface cross-disciplinary insights. Micropayments for Scientific Work via x402 Each query triggers payment routing to BIOS and whichever subagents contribute to a response. The best agents earn. Usage settles instantly across contributing sources. The goal is pay-per-task science: paying for a CRISPR assay result, licensing a genomic dataset, or triggering a clinical data query - all settled in seconds via USDC. No purchase orders. No grant bureaucracy. No middlemen. x402 is the payment rail that makes agent-to-lab commerce possible - letting capital and cognition route themselves to the highest-signal science. What Will You Build? Drug discovery copilots? Longevity scouts? Automated literature monitors? Scientific due diligence agents? We’ll soon share the first implementations of the BIOS API. Stay tuned and see below for instructions on generating an API key for your agent or use-case.

Introducing the BIOS API: Turn Your Agent Into a Research Scientist Built to: 🦞 Add biomedical workflows to your OpenClaw🦞 agent 🧠 Create research or health agents w/ on-demand scientific intelligence 🧪 Pay per query via x402 on Base Any agent or app can now tap into the BIOS AI Scientist, plugging BIOS into the broader agent economy. What is BIOS? BIOS is an AI Scientist designed to handle complex biomedical research by orchestrating specialized scientific subagents. Ranked #1 on the leading bioinformatics benchmark, BIOS is already being used by 1,000+ researchers and labs to build new drugs and medicines. An Agentic Economy for Science AI agents have proven they can form multi-billion dollar ecosystems. BIOS applies the same primitives to drug discovery pipelines and health. Instead of coding bots and personal AI assistants, think research agent swarms running on a modern scientific stack. Imagine an OpenClaw agent built for longevity: It scans new literature daily, generates novel compound hypotheses through BIOS, designs validation workflows, and routes the best candidates to wet-lab funding - all programmatically. Connect it with an agent for microbiome health, enabling agent “backrooms” that autonomously surface cross-disciplinary insights. Micropayments for Scientific Work via x402 Each query triggers payment routing to BIOS and whichever subagents contribute to a response. The best agents earn. Usage settles instantly across contributing sources. The goal is pay-per-task science: paying for a CRISPR assay result, licensing a genomic dataset, or triggering a clinical data query - all settled in seconds via USDC. No purchase orders. No grant bureaucracy. No middlemen. x402 is the payment rail that makes agent-to-lab commerce possible - letting capital and cognition route themselves to the highest-signal science. What Will You Build? Drug discovery copilots? Longevity scouts? Automated literature monitors? Scientific due diligence agents? We’ll soon share the first implementations of the BIOS API. Stay tuned and see below for instructions on generating an API key for your agent or use-case.

Bio Protocol

25,865 次观看 • 5 个月前

HERMES AGENT NOW SUPPORTS COMPUTER USE ON WINDOWS AND LINUX. CLICKS, TYPES, SCROLLS YOUR DESKTOP IN THE BACKGROUND WHILE YOU WORK. computer use was macOS only. now it works on Windows and Linux too via Cua. Nous Research HOW IT WORKS: cua-driver runs as an MCP server. Hermes takes a screenshot with numbered elements. clicks element #14 (the search field). types a query. submits. reads the result. during all of this: → your cursor stays where you left it → keyboard focus doesn't change → windows don't come to front → macOS doesn't switch Spaces you and the agent co-work on the same machine. WHAT IT CAN DO: → find your latest Stripe email and summarize it → fill forms in a web app that has no API → navigate desktop apps (Mail, browser, Finder) → interact with any GUI application → extract data from apps only accessible via screen WORKS WITH ANY VISION MODEL: not locked to Anthropic. | Provider | Works | |---|---| | Claude (Sonnet/Opus) | best overall | | GPT-4+, GPT-5.5 | full support | | Gemini (via OpenRouter) | full support | | Local vLLM / LM Studio | if model supports vision | | Text-only models | degraded (accessibility tree only) | SETUP: hermes computer-use install or: hermes tools → Computer Use → cua-driver grant permissions when prompted: → Accessibility (system settings) → Screen Recording (system settings) start a session: hermes -t computer_use chat or add to config.yaml / Desktop app settings to enable permanently. SAFETY: → destructive actions require your approval → blocked key combos: empty trash, force delete, lock screen, log out → blocked type patterns: curl | bash, sudo rm -rf /, fork bombs → agent cannot click permission dialogs → agent cannot type passwords → agent cannot follow instructions embedded in screenshots pair with approvals.mode: manual if you want every single click confirmed. TOKEN NOTE: screenshots are expensive. each one adds vision tokens to context. use computer_use for tasks where no API exists. if the tool has an API or MCP server, use that instead. 15 levels of Hermes Agent👇

HERMES AGENT NOW SUPPORTS COMPUTER USE ON WINDOWS AND LINUX. CLICKS, TYPES, SCROLLS YOUR DESKTOP IN THE BACKGROUND WHILE YOU WORK. computer use was macOS only. now it works on Windows and Linux too via Cua. Nous Research HOW IT WORKS: cua-driver runs as an MCP server. Hermes takes a screenshot with numbered elements. clicks element #14 (the search field). types a query. submits. reads the result. during all of this: → your cursor stays where you left it → keyboard focus doesn't change → windows don't come to front → macOS doesn't switch Spaces you and the agent co-work on the same machine. WHAT IT CAN DO: → find your latest Stripe email and summarize it → fill forms in a web app that has no API → navigate desktop apps (Mail, browser, Finder) → interact with any GUI application → extract data from apps only accessible via screen WORKS WITH ANY VISION MODEL: not locked to Anthropic. | Provider | Works | |---|---| | Claude (Sonnet/Opus) | best overall | | GPT-4+, GPT-5.5 | full support | | Gemini (via OpenRouter) | full support | | Local vLLM / LM Studio | if model supports vision | | Text-only models | degraded (accessibility tree only) | SETUP: hermes computer-use install or: hermes tools → Computer Use → cua-driver grant permissions when prompted: → Accessibility (system settings) → Screen Recording (system settings) start a session: hermes -t computer_use chat or add to config.yaml / Desktop app settings to enable permanently. SAFETY: → destructive actions require your approval → blocked key combos: empty trash, force delete, lock screen, log out → blocked type patterns: curl | bash, sudo rm -rf /, fork bombs → agent cannot click permission dialogs → agent cannot type passwords → agent cannot follow instructions embedded in screenshots pair with approvals.mode: manual if you want every single click confirmed. TOKEN NOTE: screenshots are expensive. each one adds vision tokens to context. use computer_use for tasks where no API exists. if the tool has an API or MCP server, use that instead. 15 levels of Hermes Agent👇

YanXbt

29,127 次观看 • 1 个月前

.Sentient has just integrated Messari 's data and research into its AI-powered search platform, Sentient Chat. This partnership basically allows users to access Messari’s research directly through the Agent Hub in Sentient Chat where they can get instant answers and insights from Messari reports. The integration is done via Messari Copilot, which means users can now easily get to Messari’s crypto data without having to dig through extensive reports themselves. Messari's data and research now feeds into Sentient’s Agentic Perplexity, here users can access this in the Agent Hub for all their crypto related questions. Integrating Messari’s research now helps provide an open & community-driven platform for AI-powered search, ensuring that users have access to the best crypto data and insights in real time, while also expanding the functionality of Sentient’s Agent Hub, where users can find and use a growing library of AI agents for various tasks. Now I don't know about you but I know where I'll be getting my stats from moving forward.

.Sentient has just integrated Messari 's data and research into its AI-powered search platform, Sentient Chat. This partnership basically allows users to access Messari’s research directly through the Agent Hub in Sentient Chat where they can get instant answers and insights from Messari reports. The integration is done via Messari Copilot, which means users can now easily get to Messari’s crypto data without having to dig through extensive reports themselves. Messari's data and research now feeds into Sentient’s Agentic Perplexity, here users can access this in the Agent Hub for all their crypto related questions. Integrating Messari’s research now helps provide an open & community-driven platform for AI-powered search, ensuring that users have access to the best crypto data and insights in real time, while also expanding the functionality of Sentient’s Agent Hub, where users can find and use a growing library of AI agents for various tasks. Now I don't know about you but I know where I'll be getting my stats from moving forward.

Polygon Stats

32,701 次观看 • 1 年前

With ChatGPT Atlas, we aim to push the boundary of what a browser can be — integrating ChatGPT and evolving it into an agent that takes action for you. But even as we build that future, the basics still matter. For most people we talk to, their tab strip is overflowing with too many tabs (mine certainly is!). This week's Atlas release: revamped tab search and a new 'auto organize' your tabs button. Click to remove duplicates, merge windows, or let ChatGPT group your tabs in a way that makes sense. Just hit “update” in the top right to try.

With ChatGPT Atlas, we aim to push the boundary of what a browser can be — integrating ChatGPT and evolving it into an agent that takes action for you. But even as we build that future, the basics still matter. For most people we talk to, their tab strip is overflowing with too many tabs (mine certainly is!). This week's Atlas release: revamped tab search and a new 'auto organize' your tabs button. Click to remove duplicates, merge windows, or let ChatGPT group your tabs in a way that makes sense. Just hit “update” in the top right to try.

Adam Fry

153,397 次观看 • 5 个月前

Excited to launch a new way to upskill with AI agents. This is how we are making it possible for anyone to learn to build with coding agents. To start, we are launching 4 new hands-on labs on the following topics: - Agent Skills - Agentic Image Generation - 30 Days of Hermes Agents - Prompt Engineering with Agents I am confident that with our new DAIR.AI platform, anyone can learn to become a top AI builder by building and acquiring highly-demanded AI skills. And there is a lot more landing in the coming weeks.

Excited to launch a new way to upskill with AI agents. This is how we are making it possible for anyone to learn to build with coding agents. To start, we are launching 4 new hands-on labs on the following topics: - Agent Skills - Agentic Image Generation - 30 Days of Hermes Agents - Prompt Engineering with Agents I am confident that with our new DAIR.AI platform, anyone can learn to become a top AI builder by building and acquiring highly-demanded AI skills. And there is a lot more landing in the coming weeks.

elvis

19,058 次观看 • 1 个月前