Back to DIVE

DIVE-Eval

L1 In-Distribution

800 tasks with disjoint seed concepts but the same 373-tool pool. Each task samples a novel tool subset of 15–50 tools. This evaluates whether the agent can generalize to new seed concepts and toolset combinations within the training distribution.

Tool Pool 384 Tools (General + Expert)
Toolset Per-task (15–50)
Protocol OpenAI Function Calling
Environment Stateless
Performance (Success Rate %)
Qwen3-8B (base)
13.0
WebExplorer-8B
19.1
DIVE-8B (SFT)
35.4
DIVE-8B (RL)
42.5
Base Best 8B Baseline DIVE (SFT) DIVE (RL)