Back to DIVE
SWE-bench Verified
L3
OOD — Specialized Tools
TaskPoolSetEnv
Software engineering benchmark requiring agents to resolve real GitHub issues by interacting with containerized codebases using Bash, file search, and code editor tools in a stateful Docker environment.
Tool Pool
Bash / Search / Editor / Finish
Toolset
Uniform
Protocol
OpenAI Function Calling
Environment
Stateful (Docker)
Performance (Success Rate %)
Base
Best 8B Baseline
DIVE (SFT)
DIVE (RL)