Back to DIVE

BrowseComp

L2 OOD — General Tools TaskSet

Web browsing comprehension benchmark requiring agents to navigate, extract, and reason over web content to answer complex questions.

Tool Pool Search / Browse
Toolset Uniform
Protocol OpenAI Function Calling
Environment Stateless
Performance (Success Rate %)
Qwen3-8B (base)
1.3
WebExplorer-8B
15.7
DIVE-8B (SFT)
12.9
DIVE-8B (RL)
16.4
Base Best 8B Baseline DIVE (SFT) DIVE (RL)