The evaluation platform for computer use agents
Evaluate and improve your computer use agent in hundreds of environments and thousands of tasks designed by |.
Docs
# load
gym = client.load("OSWorld")
env = gym.make()
observation = env.reset()
# run
for _ in range(100):
action = agent.predict(obs)
observation = env.step(action)
# evaluate
env.evaluate()
env.close()
We 💛 Researchers.






Evaluate anything.
Features
Available, always
We orchestrate 100s of concurrent machines to spin up an environment and evaluate within seconds.
20s
[ average time taken per task ]
We adapt to your agent
Equip your agent with any other tools or services you need - we'll evaluate the computer use part.
52k
[ actions performed in our gyms ]
Rich evaluations
Use our custom evaluation pipelines with state-of-the-art telemetry information and automatic judges.
10k
[ tasks analyzed and evaluated ]
Case studies
Pricing
Basic
$2/evaluation
- ✓Access to all stock evaluations
- ✓Full control, telemetry and evaluation
- ✓Access to public leaderboards
Enterprise
Custom
- ✓Bespoke evaluation set creation
- ✓Priority access to all gyms
- ✓Dedicated support team