Evidence-Based Resolution
Cases expose alerts, logs, workflow playbooks, and timeline notes gradually. The agent must collect enough evidence before proposing cause and mitigation.
RunbookOps / CaseOps Benchmark
RunbookOps turns real operational work into a deterministic benchmark. Agents must gather evidence, classify impact, route ownership, diagnose the issue, choose a safe resolution, and close customer-facing cases without shallow shortcuts.
Cases expose alerts, logs, workflow playbooks, and timeline notes gradually. The agent must collect enough evidence before proposing cause and mitigation.
Typed models, FastAPI endpoints, Docker deployment, deterministic grading, and a baseline inference runner aligned with the OpenEnv submission contract.
RunbookOps is framed as operational case handling across access failures, order issues, payment exceptions, message delivery, search freshness, and integration regressions.