Research

HUD Autonomy: What is Agentic AGI?

HUD LogoLorenss Martinsons, The HUD Team

At HUD, our mission is to help align human and AI agents' behavior. Today, we're excited to introduce Autonomy, our comprehensive evaluation framework for AI agents.

Our Mission

At HUD, we believe high quality data and evaluations will be critical to the development of agentic systems, ultimately leading to AGI. We designed the Autonomy evaluation as a first step for testing AI agents to navigate a diverse set of AI and browser tasks. As a basis for our work, we took principles from the ARC-AGI[1][2] and existing benchmarks[3][4][5][6][7][8] to create the most comprehensive eval framework for agentic AI models yet. Importantly, our benchmarks support computer, browser and tool use agents, and therefore test generalizable abilities relevant to everyday intelligence tasks.

Designing Autonomy-10

Autonomy-10 is only the first 10% of what we're building out for the most broad-scale AI agent benchmark. Still, Autonomy-10 is already the most functionally diverse evaluation of the current state of the art models. We take inspiration from current agentic AI benchmarks[3] as well as the more general AI progress signals and safety evaluations via the Humanity's Last Exam program[10].

In total, we designed 100+ tasks for Autonomy-10, around half of which we outsourced from previous projects and adapted to more rigorous guidelines. We took inspiration from OSWorld[2], WebVoyager[3], WebArena[4], GAIA[5], Mind2Web[6], SheetCopilot[7], and many other benchmarks. Importantly, we are choosing to keep our own designed tasks private for now, as we believe they are the most important part of the Autonomy evaluation framework in its early stages.

While designing new tasks for Autonomy, we aimed to maximize both information and difficulty diversity. We wrote around 5x more tasks than we could evaluate and picked out the most additive tasks for measuring the reasoning distribution. We settled on a set of 9 domains that we believe are representative of the broad range of tasks an agent would encounter – Documents, Browsing, Multimedia, Processes, Products, Applications, Collaboration, Knowledge, and Games. Through creating custom tasks for each of these categories, we were able to design a diverse benchmark with >98% completion rate for expert humans, and <25% completion rate for the best AI agents.

Analyzing Operator

We evaluated Operator through the chat interface on our 100+ task set right after its release. From our preliminary evaluation on the Autonomy-10 benchmark, a few of its major bottlenecks lie in its refusal to open certain websites (Youtube, New York Times, etc.) that may be relevant for various tasks, and limited data processing scope beyond Google's platforms. We saw that few-shot prompting occasionally allowed it to escape these guidelines, which while did lead to more positive results, also exposes some safety risks. Our safety-related evaluations showed agreement with these findings.

Practically, Operator seems to perform better than other agents via both better visual and task-relevant reasoning capabilities, likely distilled into the model via RL[9]. We see especially significant increases in the Intelligence metric, which shows improvement in fundamental capabilities of the model to adapt to new environments and reflect on its behavior. Our team at HUD is excited to see the capabilities of future agents that the Operator's release inspires, and will continue to work on the highest quality data and evaluations for developing successful AI agents.

Evaluating Intelligence

At the center of our research is establishing fundamental differences between the kind of representations that are necessary for a successful agent's behavior[2]. We define three distinct variants of capability:

  • Skill requires consistent handling of typical scenarios that have been distilled into the model. Here, the agent should show robustness to variations in every-day tasks. (implicit priors > explicit priors)
  • Flexibility requires orchestrating tasks in sub-domains that may be outside of the agent's prior knowledge. The agent should adapt to tasks that are not local to one representational domain. (explicit priors > implicit priors)
  • Intelligence requires adaptive learning of world models that include new skills and flexible representations. This metric is designed to measure the agent's skill-acquisition efficiency with respect to well-defined and minimal priors. (explicit priors = implicit priors → 0)

Current agent benchmarks mostly evaluate simple Skill tasks, which we believe will saturate by mid-2025 due to the exposure to verticalized priors in model fine-tuning. Our goal with Autonomy is to include more specialized evaluations and robustness evaluations that showcase an agent's baseline capability in diverse settings. Flexibility benchmarks will be slower to saturate – as we see the potential to develop benchmarks requiring highly specialized knowledge and integration of large knowledge bases, for example, some of the recent work in Humanity's Last Exam dataset[10]. Still, our main long-term focus is on developing a comprehensive Intelligence metric, an analogue to ARC-AGI for agents, that evaluates the general intelligence of agents in grounded settings. Autonomy-10 still features all three categories but, going forward, we may only release public versions of our more advanced intelligence metrics.

Computers, Browsers, Tools

We think of an agent's interaction with the world as any other agent interfacing with some operating system. As such, we're careful in constraining our data to a specialized environment, and define actionable tasks that can be theoretically doable by computer, browser, and tool use agents. We believe that successful AI agents will utilize different interfaces intelligently to achieve the best outcome, so we must anticipate and encourage varied solutions to the same problem.

At the same time, we must contend that there are fundamental differences in capabilities in different operating systems. For example, extracting information from browsing the internet via API use alone is complex, especially from long, multimodal or badly documented websites. Similarly, it's impossible to do industry-level work on platforms such as Davinci Resolve, Ableton Live, Blender, Unity, etc. with a browser use agent. Nevertheless, while we think a truly generalist AI will need to use a computer in the fullest sense, we want to evaluate fundamental features of models' capabilities that aren't tied to their substrate as much as possible.

To this end, Autonomy stratifies the capabilities of models between Flag, Artifact and Open tasks. Flag tasks require the agent to retrieve a well-defined point of information or achieve a particular state. Artifact tasks require the agent to produce, compile or edit a linguistic document or a set of documents given a set of criteria. Open tasks require the production of content such as an image or website, or any other knowledge state with many degrees of freedom. These categories sample the diversity of how people reason about using the platforms and creating valuable work, and we think that by moving beyond specialized agent-friendly APIs we can truly address these capabilities.

Therefore, while we think a truly generalist AI will need to use a computer in the fullest sense, we evaluate fundamental model features that aren't tied to their substrate.

We plan on open-sourcing some of these evaluation metrics soon and work closely with labs and companies to gain feedback on the future of agentic AI evaluations. We think there's much more still to understand about agentic circuits, their safety and interaction with operating systems, which is why it's important to work on this problem now.

Further Research

We are going to expand our evaluation framework and release more data in the coming weeks. We are also evaluation the ByteDance UI-TARS model and other models in the Open Source community. Join our mailing list to stay updated on our research and releases.

Work with Us

We focus on solving the data problem for computer use agents, and are currently working with frontier research labs to curate the highest quality datasets and evaluations ever. Reach out to us to learn more about how we can help your AI agent development.

backed byHUD Logo