Back to DIVE
Toolathlon
L3
OOD — Specialized Tools
TaskPoolSetEnv
Zero-shot generalist benchmark with 604 tools across 32 MCP (Model Context Protocol) applications in stateful container environments. Each task has a unique per-task toolset, testing the agent's ability to adapt to completely unseen tool ecosystems.
Tool Pool
604 Tools (32 MCP Apps)
Toolset
Per-task
Protocol
OpenAI Function Calling
Environment
Stateful (Container)
Performance (Success Rate %)
Base
Best 8B Baseline
DIVE (SFT)
DIVE (RL)