GAIA

L2 OOD — General Tools TaskSet

General AI Assistant benchmark with 103 text-only validation tasks spanning multi-hop reasoning, web search, and information synthesis. Tests broad generalization to unseen task distributions using general-purpose Search/Browse tools.

Tool Pool Search / Browse

Toolset Uniform

Protocol OpenAI Function Calling

Environment Stateless

Performance (Success Rate %)

Qwen3-8B (base)

22.4

WebExplorer-8B

50.0

DIVE-8B (SFT)

49.3

DIVE-8B (RL)

61.2

Base Best 8B Baseline DIVE (SFT) DIVE (RL)