Back to DIVE

GAIA

L2 OOD — General Tools TaskSet

General AI Assistant benchmark with 103 text-only validation tasks spanning multi-hop reasoning, web search, and information synthesis. Tests broad generalization to unseen task distributions using general-purpose Search/Browse tools.

Tool Pool Search / Browse
Toolset Uniform
Protocol OpenAI Function Calling
Environment Stateless
Performance (Success Rate %)
Qwen3-8B (base)
22.4
WebExplorer-8B
50.0
DIVE-8B (SFT)
49.3
DIVE-8B (RL)
61.2
Base Best 8B Baseline DIVE (SFT) DIVE (RL)