DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Aili Chen♠♣, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du♠♣, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, Yanghua Xiao♠†
Fudan University    MiniMax    Independent
📧 {alchen20, shawyh}@fudan.edu.cn

Abstract

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks.

We propose DIVE, an evidence-driven recipe that inverts synthesis order — executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces. DIVE scales structural diversity along two axes: tool-pool coverage and per-task toolset variety, across 373 tools in five domains.

Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68%. Diversity scaling consistently outperforms quantity scaling, even with 4× less data.

+22 avg
across 9 OOD benchmarks
+68%
vs strongest 8B baseline
373
tools, 5 domains
Motivation and overview of DIVE

Figure 1. Top: fixed-toolset limits diversity. Middle: simulated tools risk unverifiability. Bottom: DIVE's evidence-first synthesis. Radar: Gray=base; Blue=deep-research; Purple=DIVE.

About DIVE

Three diverse and decoupled resource pools:

Tool Pool

373 validated tools across 5 domains, Retrieval or Processing. Crawl–Validate pipeline with unit tests.

Seed Pool

~5k entity seeds/domain from Wikipedia, PubMed, NCBI, stock exchanges.

Exemplar Pool

Query-only exemplars with structural priors and implicit tool-use patterns.

Evidence Collection → Task Derivation loop:

  1. Config Sampling — seed $S$, toolset $\mathcal{T}$ (15–50), exemplars $\mathcal{X}$ (3–5).
  2. Evidence Collection — execute real tools: $E_k = \mathcal{F}_{\text{col}}(Q_{k-1}, E_{k-1} \mid \mathcal{T})$
  3. Task Derivation — reverse-derive: $(Q_k, A_k) = \mathcal{F}_{\text{der}}(Q_{k-1}, E_k \mid \mathcal{X})$
  4. Iterate $K$ times.

Agentic RL

3.2k frontier tasks, GRPO. $R = \alpha R_{\text{format}} + R_{\text{correct}}$

DIVE framework

Figure 2. DIVE framework: (1) Resource Preparation; (2) Evidence-Driven Synthesis; (3) Agentic Training.

Evaluation Benchmarks

We evaluate across 10 benchmarks organized into three tiers with progressively increasing distribution shift. We characterize five OOD factors relative to DIVE's training data:

Task shifted task distribution
Pool unseen tool pool
Set unseen toolset
Proto non-OpenAI protocol
Env stateful environment
Tier Task Family Benchmark OOD Factors Tool Pool Toolset Protocol Env
L1 In-Distribution DIVE-Eval 384 Tools (General + Expert) Per-task OpenAI Stateless
L2 General DeepResearch GAIA, HLE, BC, XB-DS Task Set Search / Browse Uniform OpenAI Stateless
Domain DeepResearch FinSearchComp (Global) Task Set Search / Browse / Code Uniform OpenAI Stateless
L3 Financial Specialist FAB Task Pool Set EDGAR / Web / Parse Uniform OpenAI Stateless
Medical Specialist MAB Task Pool Set Proto Env FHIR GET / POST Uniform HTTP Stateful
Software Engineering SWE-bench Task Pool Set Env Bash / Search / Editor Uniform OpenAI Stateful
Zero-Shot Generalist Toolathlon Task Pool Set Env 604 Tools (32 MCP Apps) Per-task OpenAI Stateful

Main Results

We compare DIVE-8B against two categories: 8B baselines trained on other synthesized data (including task-specific specialists like WebExplorer-8B and generalization-oriented models like EnvScaler-8B), and frontier models (≫8B) including Gemini-3-Pro, Claude-4-Sonnet, and DeepSeek-V3.2. All models are evaluated with temperature=1 and top-p=1. Scores are success rates (%).

Overall comparison across L1–L3 benchmarks. Underline: best overall; bold: best among 8B. Click benchmark names for details.

Model L1 In-dist. L2 OOD w/ General Tools L3 OOD w/ Specialized Tools
DIVE-Eval GAIA HLE BC XB-DS FSC₂ FSC₃ FAB MAB SWE Toolathlon
Hide Frontier Models
Frontier
(≫8B)
Gemini-3-Pro45.380.342.949.076.070.652.439.074.876.236.4
Claude-4-Sonnet44.863.720.812.862.260.233.339.079.372.729.9
Gemini-2.5-Pro29.160.228.49.956.044.527.424.065.159.610.5
DeepSeek-V3.2-Exp40.461.017.940.167.261.327.426.067.367.820.1
Kimi-K2-090532.960.026.914.161.047.110.728.061.269.213.0
GPT-OSS-120B40.566.019.027.069.561.022.034.064.362.09.8
8B
Baselines
WebExplorer-8B19.150.017.315.753.735.918.14.017.87.00.3
SWE-Dev-8B13.823.26.91.631.630.53.63.014.219.50.0
EnvScaler-8B15.425.82.81.745.740.710.814.056.611.52.2
OursQwen3-8B (base)13.022.46.41.324.028.67.12.038.410.80.9
DIVE-8B (SFT)35.449.313.812.950.262.133.028.050.213.24.7
DIVE-8B (RL)42.561.217.816.458.167.337.334.057.318.38.3
Key Findings
  • Robust OOD generalization: DIVE-8B (RL) improves by +22 avg points across 9 OOD benchmarks, outperforming the strongest 8B baseline by +68%.
  • Competitive with frontier models: Despite an 8B backbone, DIVE matches or approaches models 10–100× larger on deep-research benchmarks (e.g., GAIA 61.2, FSC₂ 67.3) and challenging specialized tasks (e.g., FAB 34.0, MAB 57.3).
  • Generalist beats specialists: Without task-specific training, DIVE surpasses specialist agents on their home benchmarks (GAIA: 61.2 vs. 50.0 for WebExplorer-8B), while specialists exhibit negative transfer on unseen domains.
  • Zero-shot tool use: On Toolathlon — a stringent benchmark with per-task MCP toolsets and stateful environments — DIVE improves from near-zero (0.9) to 8.3, approaching GPT-OSS-120B (9.8) and Gemini-2.5-Pro (10.5).

Scaling Analysis

(a) Diversity-only vs. Quantity-only

(b) Variety-only vs. Pool-Exp+Variety

(c) All-Path Scaling: SFT → RL

Figure 3. (a) Diversity scaling outperforms quantity scaling with 4× less data. (b) Pool expansion yields faster gains. (c) RL amplifies the diversity-scaling trend across 24 paths.

Key Takeaways
  • Diversity > Quantity: Scaling structural diversity yields stronger gains than scaling data volume — even with 4× less data.
  • Two-axis scaling: Pool expansion (more tools) and variety scaling (more toolset combinations) are complementary; combining both maximizes returns.
  • RL amplifies diversity: Reinforcement learning consistently magnifies the advantage of diversity-rich SFT checkpoints across all 24 scaling paths.

Structural Diversity Analysis

We quantify diversity at three levels: tool-pool coverage, toolset variety, and tool-use pattern complexity.

Metric Gen-DR DIVE Gain
Tools Covered 2 373 +186×
Unique Toolsets 1 46,398 +46k×
Unique Call Seqs 1,231 25,084 +20×
Unique Call Graphs 19,442 39,810 +105%
R/P Topo Classes 65 153 +135%
Tool Types / Task 1.71 3.26 +91%
Avg OOD (SFT) 22.51 32.15 +43%
R/P topology density (a) R/P topology density
Tool call frequency (b) Tool frequency
RL training dynamics

Figure 5. RL training dynamics: accuracy, tool calls, unique graphs, and R/P topologies over 100 steps.

Key Takeaways
  • Massive diversity gains: DIVE covers 373 tools (vs. 2) and generates 46k unique toolsets — orders-of-magnitude more structural diversity than quantity-scaled baselines.
  • RL discovers novel strategies: During RL training, unique call-graph structures increase steadily — the model learns new tool-composition patterns beyond SFT demonstrations.
  • Topology shift: DIVE shifts the R/P topology distribution from retrieval-dominated to a balanced retrieval-processing mix, enabling more complex reasoning chains.
  • Long-tail coverage: DIVE activates the full 373-tool pool, maintaining meaningful frequency even for rare domain-specific tools.

BibTeX

@inproceedings{chen2026dive,
  title={{DIVE}: Scaling Diversity in Agentic Task Synthesis
         for Generalizable Tool Use},
  author={},
  booktitle={},
  year={2026}
}