DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Abstract

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks.

We propose DIVE, an evidence-driven recipe that inverts synthesis order — executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces. DIVE scales structural diversity along two axes: tool-pool coverage and per-task toolset variety, across 373 tools in five domains.

Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68%. Diversity scaling consistently outperforms quantity scaling, even with 4× less data.

+22 avg
across 9 OOD benchmarks
+68%
vs strongest 8B baseline
373
tools, 5 domains

Figure 1. Top: fixed-toolset limits diversity. Middle: simulated tools risk unverifiability. Bottom: DIVE's evidence-first synthesis. Radar: Gray=base; Blue=deep-research; Purple=DIVE.

Three diverse and decoupled resource pools:

Tool Pool

373 validated tools across 5 domains, Retrieval or Processing. Crawl–Validate pipeline with unit tests.

Seed Pool

~5k entity seeds/domain from Wikipedia, PubMed, NCBI, stock exchanges.

Exemplar Pool

Query-only exemplars with structural priors and implicit tool-use patterns.

Evidence Collection → Task Derivation loop:

Config Sampling — seed $S$, toolset $\mathcal{T}$ (15–50), exemplars $\mathcal{X}$ (3–5).
Evidence Collection — execute real tools: $E_k = \mathcal{F}_{\text{col}}(Q_{k-1}, E_{k-1} \mid \mathcal{T})$
Task Derivation — reverse-derive: $(Q_k, A_k) = \mathcal{F}_{\text{der}}(Q_{k-1}, E_k \mid \mathcal{X})$
Iterate $K$ times.

SFT Cold Start

Teacher + rejection sampling → 48k verified trajectories.

Agentic RL

3.2k frontier tasks, GRPO. $R = \alpha R_{\text{format}} + R_{\text{correct}}$

Figure 2. DIVE framework: (1) Resource Preparation; (2) Evidence-Driven Synthesis; (3) Agentic Training.

Evaluation Benchmarks

We evaluate across 10 benchmarks organized into three tiers with progressively increasing distribution shift. We characterize five OOD factors relative to DIVE's training data:

Task shifted task distribution

Pool unseen tool pool

Set unseen toolset

Proto non-OpenAI protocol

Env stateful environment

Tier	Task Family	Benchmark	OOD Factors	Tool Pool	Toolset	Protocol	Env
L1	In-Distribution	DIVE-Eval	—	384 Tools (General + Expert)	Per-task	OpenAI	Stateless
L2	General DeepResearch	GAIA, HLE, BC, XB-DS	Task Set	Search / Browse	Uniform	OpenAI	Stateless
L2	Domain DeepResearch	FinSearchComp (Global)	Task Set	Search / Browse / Code	Uniform	OpenAI	Stateless
L3	Financial Specialist	FAB	Task Pool Set	EDGAR / Web / Parse	Uniform	OpenAI	Stateless
	Medical Specialist	MAB	Task Pool Set Proto Env	FHIR GET / POST	Uniform	HTTP	Stateful
	Software Engineering	SWE-bench	Task Pool Set Env	Bash / Search / Editor	Uniform	OpenAI	Stateful
	Zero-Shot Generalist	Toolathlon	Task Pool Set Env	604 Tools (32 MCP Apps)	Per-task	OpenAI	Stateful

Main Results

We compare DIVE-8B against two categories: 8B baselines trained on other synthesized data (including task-specific specialists like WebExplorer-8B and generalization-oriented models like EnvScaler-8B), and frontier models (≫8B) including Gemini-3-Pro, Claude-4-Sonnet, and DeepSeek-V3.2. All models are evaluated with temperature=1 and top-p=1. Scores are success rates (%).

Overall comparison across L1–L3 benchmarks. Underline: best overall; bold: best among 8B. Click benchmark names for details.

	Model	L1 In-dist.	L2 OOD w/ General Tools						L3 OOD w/ Specialized Tools
	Model	DIVE-Eval	GAIA	HLE	BC	XB-DS	FSC₂	FSC₃	FAB	MAB	SWE	Toolathlon
▶ Hide Frontier Models
Frontier (≫8B)	Gemini-3-Pro	45.3	80.3	42.9	49.0	76.0	70.6	52.4	39.0	74.8	76.2	36.4
	Claude-4-Sonnet	44.8	63.7	20.8	12.8	62.2	60.2	33.3	39.0	79.3	72.7	29.9
	Gemini-2.5-Pro	29.1	60.2	28.4	9.9	56.0	44.5	27.4	24.0	65.1	59.6	10.5
	DeepSeek-V3.2-Exp	40.4	61.0	17.9	40.1	67.2	61.3	27.4	26.0	67.3	67.8	20.1
	Kimi-K2-0905	32.9	60.0	26.9	14.1	61.0	47.1	10.7	28.0	61.2	69.2	13.0
	GPT-OSS-120B	40.5	66.0	19.0	27.0	69.5	61.0	22.0	34.0	64.3	62.0	9.8
8B Baselines	WebExplorer-8B	19.1	50.0	17.3	15.7	53.7	35.9	18.1	4.0	17.8	7.0	0.3
	SWE-Dev-8B	13.8	23.2	6.9	1.6	31.6	30.5	3.6	3.0	14.2	19.5	0.0
	EnvScaler-8B	15.4	25.8	2.8	1.7	45.7	40.7	10.8	14.0	56.6	11.5	2.2
Ours	Qwen3-8B (base)	13.0	22.4	6.4	1.3	24.0	28.6	7.1	2.0	38.4	10.8	0.9
	DIVE-8B (SFT)	35.4	49.3	13.8	12.9	50.2	62.1	33.0	28.0	50.2	13.2	4.7
	DIVE-8B (RL)	42.5	61.2	17.8	16.4	58.1	67.3	37.3	34.0	57.3	18.3	8.3

Key Findings

Robust OOD generalization: DIVE-8B (RL) improves by +22 avg points across 9 OOD benchmarks, outperforming the strongest 8B baseline by +68%.
Competitive with frontier models: Despite an 8B backbone, DIVE matches or approaches models 10–100× larger on deep-research benchmarks (e.g., GAIA 61.2, FSC₂ 67.3) and challenging specialized tasks (e.g., FAB 34.0, MAB 57.3).
Generalist beats specialists: Without task-specific training, DIVE surpasses specialist agents on their home benchmarks (GAIA: 61.2 vs. 50.0 for WebExplorer-8B), while specialists exhibit negative transfer on unseen domains.
Zero-shot tool use: On Toolathlon — a stringent benchmark with per-task MCP toolsets and stateful environments — DIVE improves from near-zero (0.9) to 8.3, approaching GPT-OSS-120B (9.8) and Gemini-2.5-Pro (10.5).

Scaling Analysis

(a) Diversity-only vs. Quantity-only

(b) Variety-only vs. Pool-Exp+Variety

(c) All-Path Scaling: SFT → RL

Figure 3. (a) Diversity scaling outperforms quantity scaling with 4× less data. (b) Pool expansion yields faster gains. (c) RL amplifies the diversity-scaling trend across 24 paths.

Key Takeaways

Diversity > Quantity: Scaling structural diversity yields stronger gains than scaling data volume — even with 4× less data.
Two-axis scaling: Pool expansion (more tools) and variety scaling (more toolset combinations) are complementary; combining both maximizes returns.
RL amplifies diversity: Reinforcement learning consistently magnifies the advantage of diversity-rich SFT checkpoints across all 24 scaling paths.

Structural Diversity Analysis

We quantify diversity at three levels: tool-pool coverage, toolset variety, and tool-use pattern complexity.

Metric	Gen-DR	DIVE	Gain
Tools Covered	2	373	+186×
Unique Toolsets	1	46,398	+46k×
Unique Call Seqs	1,231	25,084	+20×
Unique Call Graphs	19,442	39,810	+105%
R/P Topo Classes	65	153	+135%
Tool Types / Task	1.71	3.26	+91%
Avg OOD (SFT)	22.51	32.15	+43%

(a) R/P topology density

(b) Tool frequency

Figure 5. RL training dynamics: accuracy, tool calls, unique graphs, and R/P topologies over 100 steps.

Key Takeaways

Massive diversity gains: DIVE covers 373 tools (vs. 2) and generates 46k unique toolsets — orders-of-magnitude more structural diversity than quantity-scaled baselines.
RL discovers novel strategies: During RL training, unique call-graph structures increase steadily — the model learns new tool-composition patterns beyond SFT demonstrations.
Topology shift: DIVE shifts the R/P topology distribution from retrieval-dominated to a balanced retrieval-processing mix, enabling more complex reasoning chains.
Long-tail coverage: DIVE activates the full 373-tool pool, maintaining meaningful frequency even for rare domain-specific tools.

BibTeX

@inproceedings{chen2026dive,
  title={{DIVE}: Scaling Diversity in Agentic Task Synthesis
         for Generalizable Tool Use},
  author={},
  booktitle={},
  year={2026}
}