The evaluation platform for computer use agents
Evaluate and iterate on your computer use agent in hundreds of environments and thousands of tasks designed by |
We 💛 Researchers.





Evaluate anything.
Evaluate instantly, anytime.
Stop waiting hours for results. Our platform orchestrates hundreds of concurrent machines, spinning up full OS environments in seconds for rapid evaluation cycles. Iterate faster, identify regressions sooner, and push better agents to production.
OSWorld Benchmark Runtime Comparison
Integrate Your Agent's Stack
Evaluate agentic abilities while leveraging existing tools & models.
Agents, your way.
Don't force your agent into a specific mold. The HUD evaluation schema can adapt to any architecture. Bring your own tools, models (like VLMs or RAG systems), or APIs. Focus on evaluating the core agentic abilities on various environments, while integrating any unique components of your agent stack.
Any evaluation, any environment.
Go beyond standard benchmark sets. Create tasks tailored to your specific agent, product, or workflow across diverse environments. Evaluate performance on desktop software, web browsers, text-based interfaces, or proprietary dockerfile environments.
Test on any environment
Build evaluations for niche workflows, proprietary tools, and unique agent loops.
Pricing
Basic
Most informative evalsets cost ~$10-15 per run (avg. 10 min).
*Plus $0.15/hr per active environment.
- ✓Access to all stock evaluations
- ✓Full control, telemetry and evaluation
- ✓Access to public leaderboards (Coming soon)
Start with $10 in free credits!
Get startedEnterprise
Significant discounts available for labs running evals at scale.
- ✓Benchmark agents on proprietary datasets & workflows
- ✓Stress-test new models before production deployment
- ✓Dedicated support for complex evaluation needs
Are you a researcher?
Get $100 in free credits when you sign up with a .edu email address.
Need more details?Get a pricing breakdown in your inbox.
Or maybe you have specific needs?Tell us what you're building.