Back to DIVE
HLE (Humanity's Last Exam)
L2
OOD — General Tools
TaskSet
Extremely challenging expert-level questions spanning science, math, humanities, and more. Designed to be the hardest public benchmark for LLMs, testing the upper bound of agentic tool-augmented reasoning.
Tool Pool
Search / Browse
Toolset
Uniform
Protocol
OpenAI Function Calling
Environment
Stateless
Performance (Success Rate %)
Base
Best 8B Baseline
DIVE (SFT)
DIVE (RL)