Back to DIVE

Xbench-DeepSearch

L2 OOD — General Tools TaskSet

Cross-lingual deep search benchmark (DeepSearch subset) requiring multi-lingual information retrieval and synthesis across web sources.

Tool Pool Search / Browse
Toolset Uniform
Protocol OpenAI Function Calling
Environment Stateless
Performance (Success Rate %)
Qwen3-8B (base)
24.0
WebExplorer-8B
53.7
DIVE-8B (SFT)
50.2
DIVE-8B (RL)
58.1
Base Best 8B Baseline DIVE (SFT) DIVE (RL)