DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Abstract

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks.

We propose DIVE, an evidence-driven recipe that inverts synthesis order — executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces. DIVE scales structural diversity along two axes: tool-pool coverage and per-task toolset variety, across 373 tools in five domains.

Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68%. Diversity scaling consistently outperforms quantity scaling, even with 4× less data.

+22 avg
across 9 OOD benchmarks
+68%
vs strongest 8B baseline
373
tools, 5 domains

Figure 1. Top: fixed-toolset limits diversity. Middle: simulated tools risk unverifiability. Bottom: DIVE's evidence-first synthesis. Radar: Gray=base; Blue=deep-research; Purple=DIVE.

Three diverse and decoupled resource pools:

Tool Pool

373 validated tools across 5 domains, Retrieval or Processing. Crawl–Validate pipeline with unit tests.

Seed Pool

~5k entity seeds/domain from Wikipedia, PubMed, NCBI, stock exchanges.

Exemplar Pool

Query-only exemplars with structural priors and implicit tool-use patterns.

Evidence Collection → Task Derivation loop:

Config Sampling — seed $S$, toolset $\mathcal{T}$ (15–50), exemplars $\mathcal{X}$ (3–5).
Evidence Collection — execute real tools: $E_k = \mathcal{F}_{\text{col}}(Q_{k-1}, E_{k-1} \mid \mathcal{T})$
Task Derivation — reverse-derive: $(Q_k, A_k) = \mathcal{F}_{\text{der}}(Q_{k-1}, E_k \mid \mathcal{X})$
Iterate $K$ times.

SFT Cold Start

Teacher + rejection sampling → 48k verified trajectories.

Agentic RL

3.2k frontier tasks, GRPO. $R = \alpha R_{\text{format}} + R_{\text{correct}}$

Figure 2. DIVE framework: (1) Resource Preparation; (2) Evidence-Driven Synthesis; (3) Agentic Training.

Evaluation Benchmarks

We evaluate across 10 benchmarks organized into three tiers with progressively increasing distribution shift. We characterize five OOD factors relative to DIVE's training data:

Task shifted task distribution

Pool unseen tool pool

Set unseen toolset

Proto non-OpenAI protocol

Env stateful environment

Tier	Task Family	Benchmark	OOD Factors	Tool Pool	Toolset	Protocol	Env
L1	In-Distribution	DIVE-Eval	—	384 Tools (General + Expert)	Per-task	OpenAI	Stateless
L2	General DeepResearch	GAIA, HLE, BC, XB-DS	Task Set	Search / Browse	Uniform	OpenAI	Stateless
L2	Domain DeepResearch	FinSearchComp (Global)	Task Set	Search / Browse / Code	Uniform	OpenAI	Stateless
L3	Financial Specialist	FAB	Task Pool Set	EDGAR / Web / Parse	Uniform	OpenAI	Stateless
	Medical Specialist	MAB	Task Pool Set Proto Env	FHIR GET / POST	Uniform	HTTP	Stateful
	Software Engineering	SWE-bench	Task Pool Set Env	Bash / Search / Editor	Uniform	OpenAI	Stateful
	Zero-Shot Generalist	Toolathlon	Task Pool Set Env	604 Tools (32 MCP Apps)	Per-task	OpenAI	Stateful

Main Results

We compare DIVE-8B against two categories: 8B baselines trained on other synthesized data (including task-specific specialists like WebExplorer-8B and generalization-oriented models like EnvScaler-8B), and frontier models (≫8B) including Gemini-3-Pro, Claude-4-Sonnet, and DeepSeek-V3.2. All models are evaluated with temperature=1 and top-p=1. Scores are success rates (%).

Overall comparison across L1–L3 benchmarks. Underline: best overall; bold: best among 8B. Click benchmark names for details.

	Model	L1 In-dist.	L2 OOD w/ General Tools						L3 OOD w/ Specialized Tools
	Model	DIVE-Eval	GAIA	HLE	BC	XB-DS	FSC₂	FSC₃	FAB	MAB	SWE	Toolathlon
▶ Hide Frontier Models
Frontier (≫8B)	Gemini-3-Pro	45.3	80.3	42.9	49.0	76.0	70.6	52.4	39.0	74.8	76.2	36.4
	Claude-4-Sonnet	44.8	63.7	20.8	12.8	62.2	60.2	33.3	39.0	79.3	72.7	29.9
	Gemini-2.5-Pro	29.1	60.2	28.4	9.9	56.0	44.5	27.4	24.0	65.1	59.6	10.5
	DeepSeek-V3.2-Exp	40.4	61.0	17.9	40.1	67.2	61.3	27.4	26.0	67.3	67.8	20.1
	Kimi-K2-0905	32.9	60.0	26.9	14.1	61.0	47.1	10.7	28.0	61.2	69.2	13.0
	GPT-OSS-120B	40.5	66.0	19.0	27.0	69.5	61.0	22.0	34.0	64.3	62.0	9.8
8B Baselines	WebExplorer-8B	19.1	50.0	17.3	15.7	53.7	35.9	18.1	4.0	17.8	7.0	0.3
	SWE-Dev-8B	13.8	23.2	6.9	1.6	31.6	30.5	3.6	3.0	14.2	19.5	0.0
	EnvScaler-8B	15.4	25.8	2.8	1.7	45.7	40.7	10.8	14.0	56.6	11.5	2.2
Ours	Qwen3-8B (base)	13.0	22.4	6.4	1.3	24.0	28.6	7.1	2.0	38.4	10.8	0.9
	DIVE-8B (SFT)	35.4	49.3	13.8	12.9	50.2	62.1	33.0	28.0	50.2	13.2	4.7
	DIVE-8B (RL)	42.5	61.2	17.8	16.4	58.1	67.3	37.3	34.0	57.3	18.3	8.3

Key Findings

Robust OOD generalization: DIVE-8B (RL) improves by +22 avg points across 9 OOD benchmarks, outperforming the strongest 8B baseline by +68%.
Competitive with frontier models: Despite an 8B backbone, DIVE matches or approaches models 10–100× larger on deep-research benchmarks (e.g., GAIA 61.2, FSC₂ 67.3) and challenging specialized tasks (e.g., FAB 34.0, MAB 57.3).
Generalist beats specialists: Without task-specific training, DIVE surpasses specialist agents on their home benchmarks (GAIA: 61.2 vs. 50.0 for WebExplorer-8B), while specialists exhibit negative transfer on unseen domains.
Zero-shot tool use: On Toolathlon — a stringent benchmark with per-task MCP toolsets and stateful environments — DIVE improves from near-zero (0.9) to 8.3, approaching GPT-OSS-120B (9.8) and Gemini-2.5-Pro (10.5).

Scaling Analysis

(a) Diversity-only vs. Quantity-only

(b) Variety-only vs. Pool-Exp+Variety

(c) All-Path Scaling: SFT → RL

Figure 3. (a) Diversity scaling outperforms quantity scaling with 4× less data. (b) Pool expansion yields faster gains. (c) RL amplifies the diversity-scaling trend across 24 paths.

Key Takeaways

Diversity > Quantity: Scaling structural diversity yields stronger gains than scaling data volume — even with 4× less data.
Two-axis scaling: Pool expansion (more tools) and variety scaling (more toolset combinations) are complementary; combining both maximizes returns.
RL amplifies diversity: Reinforcement learning consistently magnifies the advantage of diversity-rich SFT checkpoints across all 24 scaling paths.

Structural Diversity Analysis

We quantify diversity at three levels: tool-pool coverage, toolset variety, and tool-use pattern complexity.

Metric	Gen-DR	DIVE	Gain
Tools Covered	2	373	+186×
Unique Toolsets	1	46,398	+46k×
Unique Call Seqs	1,231	25,084	+20×
Unique Call Graphs	19,442	39,810	+105%
R/P Topo Classes	65	153	+135%
Tool Types / Task	1.71	3.26	+91%
Avg OOD (SFT)	22.51	32.15	+43%

(a) R/P topology density

(b) Tool frequency

Figure 5. RL training dynamics: accuracy, tool calls, unique graphs, and R/P topologies over 100 steps.

Key Takeaways

Massive diversity gains: DIVE covers 373 tools (vs. 2) and generates 46k unique toolsets — orders-of-magnitude more structural diversity than quantity-scaled baselines.
RL discovers novel strategies: During RL training, unique call-graph structures increase steadily — the model learns new tool-composition patterns beyond SFT demonstrations.
Topology shift: DIVE shifts the R/P topology distribution from retrieval-dominated to a balanced retrieval-processing mix, enabling more complex reasoning chains.
Long-tail coverage: DIVE activates the full 373-tool pool, maintaining meaningful frequency even for rare domain-specific tools.

BibTeX

@misc{chen2026divescalingdiversityagentic,
      title={DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use},
      author={Aili Chen and Chi Zhang and Junteng Liu and Jiangjie Chen and Chengyu Du and Yunji Li and Ming Zhong and Qin Wang and Zhengmao Zhu and Jiayuan Song and Ke Ji and Junxian He and Pengyu Zhao and Yanghua Xiao},
      year={2026},
      eprint={2603.11076},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.11076},
}