Back to DIVE

HLE (Humanity's Last Exam)

L2 OOD — General Tools TaskSet

Extremely challenging expert-level questions spanning science, math, humanities, and more. Designed to be the hardest public benchmark for LLMs, testing the upper bound of agentic tool-augmented reasoning.

Tool Pool Search / Browse
Toolset Uniform
Protocol OpenAI Function Calling
Environment Stateless
Performance (Success Rate %)
Qwen3-8B (base)
6.4
WebExplorer-8B
17.3
DIVE-8B (SFT)
13.8
DIVE-8B (RL)
17.8
Base Best 8B Baseline DIVE (SFT) DIVE (RL)