The evaluation platform for computer use agents

Evaluate and iterate on your computer use agent in hundreds of environments and thousands of tasks designed by |

from hud import load_taskset, run_job, ClaudeAgent

# load
taskset = load_taskset("GAIA")

# evaluate
job = await run_job(ClaudeAgent, taskset, "test-gaia-job")

# get results OR view them in app.hud.so
print(await job.get_analytics())

We 💛 Researchers.

Evaluate anything.

OSWorld

Academic369 tasks

1. OpenAI CUA38.1%

2. Claude 3.7 Sonnet28%

3. UI-TARS-72B24.6%

Launch

Financial Analysis 1

Professional15 tasks

I'm interested!

financial-analyst

WebArena

Academic25 tasks

I'm interested!

webarena

Pokemon 1

Gaming10 tasks

I'm interested!

game-agent

WebVoyager

Academic643 tasks

I'm interested!

webvoyager

Autonomy-10

Private30 tasks

I'm interested!

autonomy

GeoGuessr 1

Gaming50 tasks

I'm interested!

geoguessr

HR 1

Professional15 tasks

I'm interested!

hr-analytics

Legal Research 1

Professional15 tasks

I'm interested!

legal-researcher

Evaluate instantly, anytime.

Stop waiting hours for results. Our platform orchestrates hundreds of concurrent machines, spinning up full OS environments in seconds for rapid evaluation cycles. Iterate faster, identify regressions sooner, and push better agents to production.

OSWorld Benchmark Runtime Comparison

Integrate Your Agent's Stack

Evaluate agentic abilities while leveraging existing tools & models.

OpenAI Operator

Claude Computer Use

MCP

RAG

Agents, your way.

Don't force your agent into a specific mold. The HUD evaluation schema can adapt to any architecture. Bring your own tools, models (like VLMs or RAG systems), or APIs. Focus on evaluating the core agentic abilities on various environments, while integrating any unique components of your agent stack.

Any evaluation, any environment.

Go beyond standard benchmark sets. Create tasks tailored to your specific agent, product, or workflow across diverse environments. Evaluate performance on desktop software, web browsers, text-based interfaces, or proprietary dockerfile environments.