EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings
Abstract
Large language models are shifting from passive information providers to active agents intended for complex workflows. However, their deployment as reliable AI workers in enterprise is stalled by benchmarks that fail to capture the intricacies of professional environments, specifically, the need for long-horizon planning amidst persistent state changes and strict access protocols. In this work, we introduce EnterpriseOps-Gym, a benchmark designed to evaluate agentic planning in realistic enterprise settings. Specifically, EnterpriseOps-Gym features a containerized sandbox with 164 database tables and 512 functional tools to mimic real-world search friction. Within this environment, agents are evaluated on 1,150 expert-curated tasks across eight mission-critical verticals (including Customer Service, HR, and IT). Our evaluation of 14 frontier models reveals critical limitations in state-of-the-art models: the top-performing Claude Opus 4.5 achieves only 37.4% success. Further analysis shows that providing oracle human plans improves performance by 14-35 percentage points, pinpointing strategic reasoning as the primary bottleneck. Additionally, agents frequently fail to refuse infeasible tasks (best model achieves 53.9%), leading to unintended and potentially harmful side effects. Our findings underscore that current agents are not yet ready for autonomous enterprise deployment. More broadly, EnterpriseOps-Gym provides a concrete testbed to advance the robustness of agentic planning in professional workflows.
Community
The paper introduces EnterpriseOps-Gym, a benchmark designed to evaluate LLM agents on stateful planning and tool use in realistic enterprise environments. The evaluation framework features a fully interactive sandbox containing 164 database tables, 512 tools, and 1,150 expert-curated tasks across eight enterprise domains. Evaluations across 14 frontier models show a performance ceiling of 37.4% task success, underscoring that long-horizon planning and policy adherence are the true bottlenecks preventing the deployment of reliable AI workers.
paper: https://arxiv.org/pdf/2603.13594
github: https://github.com/ServiceNow/EnterpriseOps-Gym
website: https://enterpriseops-gym.github.io
benchmark: https://huggingface.co/datasets/ServiceNow-AI/EnterpriseOps-Gym
would be very interesting to explore this more into Human In the Loop workflows, i.e. exact workflows where involvement is bottlenecked to human control, etc.
Basically non-HITL success rate at 37.4% is very good - humans should do something at the end
@panikov - agree 100%.Our focus will be on HITL workflows for V2 among other things. Appreciate your feedback.!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context (2026)
- World of Workflows: a Benchmark for Bringing World Models to Enterprise Systems (2026)
- CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty (2026)
- Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments (2026)
- EntWorld: A Holistic Environment and Benchmark for Verifiable Enterprise GUI Agents (2026)
- DevOps-Gym: Benchmarking AI Agents in Software DevOps Cycle (2026)
- LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper