HLE (Humanity's Last Exam)

L2 OOD — General Tools TaskSet

Extremely challenging expert-level questions spanning science, math, humanities, and more. Designed to be the hardest public benchmark for LLMs, testing the upper bound of agentic tool-augmented reasoning.

Tool Pool Search / Browse

Toolset Uniform

Protocol OpenAI Function Calling

Environment Stateless

Performance (Success Rate %)

Qwen3-8B (base)

6.4

WebExplorer-8B

17.3

DIVE-8B (SFT)

13.8

DIVE-8B (RL)

17.8

Base Best 8B Baseline DIVE (SFT) DIVE (RL)