Back to DIVE

Toolathlon

L3 OOD — Specialized Tools TaskPoolSetEnv

Zero-shot generalist benchmark with 604 tools across 32 MCP (Model Context Protocol) applications in stateful container environments. Each task has a unique per-task toolset, testing the agent's ability to adapt to completely unseen tool ecosystems.

Tool Pool 604 Tools (32 MCP Apps)
Toolset Per-task
Protocol OpenAI Function Calling
Environment Stateful (Container)
Performance (Success Rate %)
Qwen3-8B (base)
0.9
EnvScaler-8B
2.2
DIVE-8B (SFT)
4.7
DIVE-8B (RL)
8.3
Base Best 8B Baseline DIVE (SFT) DIVE (RL)