Back to DIVE

SWE-bench Verified

L3 OOD — Specialized Tools TaskPoolSetEnv

Software engineering benchmark requiring agents to resolve real GitHub issues by interacting with containerized codebases using Bash, file search, and code editor tools in a stateful Docker environment.

Tool Pool Bash / Search / Editor / Finish
Toolset Uniform
Protocol OpenAI Function Calling
Environment Stateful (Docker)
Performance (Success Rate %)
Qwen3-8B (base)
10.8
SWE-Dev-8B
19.5
DIVE-8B (SFT)
13.2
DIVE-8B (RL)
18.3
Base Best 8B Baseline DIVE (SFT) DIVE (RL)