DIVE-Eval

L1 In-Distribution

800 tasks with disjoint seed concepts but the same 373-tool pool. Each task samples a novel tool subset of 15–50 tools. This evaluates whether the agent can generalize to new seed concepts and toolset combinations within the training distribution.

Tool Pool 384 Tools (General + Expert)

Toolset Per-task (15–50)

Protocol OpenAI Function Calling

Environment Stateless

Performance (Success Rate %)

Qwen3-8B (base)

13.0

WebExplorer-8B

19.1

DIVE-8B (SFT)

35.4

DIVE-8B (RL)

42.5

Base Best 8B Baseline DIVE (SFT) DIVE (RL)