DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use

Aili Chen♠♣, Chi Zhang, Junteng Liu, Jiangjie Chen, Chengyu Du♠♣, Yunji Li, Ming Zhong, Qin Wang, Zhengmao Zhu, Jiayuan Song, Ke Ji, Junxian He, Pengyu Zhao, Yanghua Xiao♠†
Fudan University    MiniMax    Independent
📧 {alchen20, shawyh}@fudan.edu.cn

Abstract

Recent work synthesizes agentic tasks for post-training tool-using LLMs, yet robust generalization under shifts in tasks and toolsets remains an open challenge. We trace this brittleness to insufficient diversity in synthesized tasks.

We propose DIVE, an evidence-driven recipe that inverts synthesis order — executing diverse, real-world tools first and reverse-deriving tasks strictly entailed by the resulting traces. DIVE scales structural diversity along two axes: tool-pool coverage and per-task toolset variety, across 373 tools in five domains.

Training Qwen3-8B on DIVE data (48k SFT + 3.2k RL) improves by +22 average points across 9 OOD benchmarks and outperforms the strongest 8B baseline by +68%. Diversity scaling consistently outperforms quantity scaling, even with 4× less data.

+22 avg
across 9 OOD benchmarks
+68%
vs strongest 8B baseline
373
tools, 5 domains
Motivation and overview of DIVE

Figure 1. Top: fixed-toolset limits diversity. Middle: simulated tools risk unverifiability. Bottom: DIVE's evidence-first synthesis. Radar: Gray=base; Blue=deep-research; Purple=DIVE.

About DIVE

Three diverse and decoupled resource pools:

Tool Pool

373 validated tools across 5 domains, Retrieval or Processing. Crawl–Validate pipeline with unit tests.

Seed Pool

~5k entity seeds/domain from Wikipedia, PubMed, NCBI, stock exchanges.

Exemplar Pool

Query-only exemplars with structural priors and implicit tool-use patterns.

Evidence Collection → Task Derivation loop:

  1. Config Sampling — seed $S$, toolset $\mathcal{T}$ (15–50), exemplars $\mathcal{X}$ (3–5).
  2. Evidence Collection — execute real tools: $E_k = \mathcal{F}_{\text{col}}(Q_{k-1}, E_{k-1} \mid \mathcal{T})$
  3. Task Derivation — reverse-derive: $(Q_k, A_k) = \mathcal{F}_{\text{der}}(Q_{k-1}, E_k \mid \mathcal{X})$
  4. Iterate $K$ times.

Agentic RL

3.2k frontier tasks, GRPO. $R = \alpha R_{\text{format}} + R_{\text{correct}}$

DIVE framework

Figure 2. DIVE framework: (1) Resource Preparation; (2) Evidence-Driven Synthesis; (3) Agentic Training.

Evaluation Benchmarks

We evaluate across 10 benchmarks organized into three tiers with progressively increasing distribution shift. We characterize five OOD factors relative to DIVE's training data:

Task shifted task distribution
Pool unseen tool pool
Set unseen toolset
Proto non-OpenAI protocol
Env stateful environment
Tier Task Family Benchmark OOD Factors Tool Pool Toolset Protocol Env
L1 In-Distribution DIVE-Eval 384 Tools (General + Expert) Per-task OpenAI Stateless
L2 General DeepResearch GAIA, HLE, BC, XB-DS Task Set Search / Browse Uniform OpenAI Stateless
Domain DeepResearch FinSearchComp (Global) Task Set Search / Browse / Code Uniform OpenAI Stateless
L3 Financial Specialist FAB Task Pool Set EDGAR / Web / Parse Uniform OpenAI Stateless
Medical Specialist MAB Task Pool Set Proto Env FHIR GET / POST Uniform HTTP Stateful
Software Engineering SWE-bench Task Pool Set Env Bash / Search / Editor Uniform OpenAI Stateful
Zero-Shot Generalist Toolathlon Task Pool Set Env 604 Tools (32 MCP Apps) Per-task OpenAI Stateful

Main Results

We compare DIVE-8B against two categories: 8B baselines trained on other synthesized data (including task-specific specialists like WebExplorer-8B and generalization-oriented models like EnvScaler-8B), and frontier models (≫8B) including Gemini-3-Pro, Claude-4-Sonnet, and DeepSeek-V3.2. All models are evaluated with temperature=1 and top-p=1. Scores are success rates (%).

Overall comparison across L1–L3 benchmarks. Underline: best overall; bold: best among 8B. Click benchmark names for details.

Model L1 In-dist. L2 OOD w/ General Tools L3 OOD w/ Specialized Tools
DIVE-Eval GAIA HLE BC XB-DS FSC₂ FSC₃ FAB MAB SWE Toolathlon
Hide Frontier Models
Frontier
(≫8B)
Gemini-3-Pro45.380.342.949.076.070.652.439.074.876.236.4
Claude-4-Sonnet44.863.720.812.862.260.233.339.079.372.729.9
Gemini-2.5-Pro29.160.228.49.956.044.527.424.065.159.610.5
DeepSeek-V3.2-Exp40.461.017.940.167.261.327.426.067.367.820.1
Kimi-K2-090532.960.026.914.161.047.110.728.061.269.213.0
GPT-OSS-120B40.566.019.027.069.561.022.034.064.362.09.8
8B
Baselines
WebExplorer-8B19.150.017.315.753.735.918.14.017.87.00.3
SWE-Dev-8B13.823.26.91.631.630.53.63.014.219.50.0
EnvScaler-8B15.425.82.81.745.740.710.814.056.611.52.2
OursQwen3-8B (base)13.022.46.41.324.028.67.12.038.410.80.9
DIVE-8B (SFT)35.449.313.812.950.262.133.028.050.213.24.7
DIVE-8B (RL)42.561.217.816.458.167.337.334.057.318.38.3
Key Findings
  • Robust OOD generalization: DIVE-8B (RL) improves by +22 avg points across 9 OOD benchmarks, outperforming the strongest 8B baseline by +68%.
  • Competitive with frontier models: Despite an 8B backbone, DIVE matches or approaches models 10–100× larger on deep-research benchmarks (e.g., GAIA 61.2, FSC₂ 67.3) and challenging specialized tasks (e.g., FAB 34.0, MAB 57.3).
  • Generalist beats specialists: Without task-specific training, DIVE surpasses specialist agents on their home benchmarks (GAIA: 61.2 vs. 50.0 for WebExplorer-8B), while specialists exhibit negative transfer on unseen domains.
  • Zero-shot tool use: On Toolathlon — a stringent benchmark with per-task MCP toolsets and stateful environments — DIVE improves from near-zero (0.9) to 8.3, approaching GPT-OSS-120B (9.8) and Gemini-2.5-Pro (10.5).

Scaling Analysis

(a) Diversity-only vs. Quantity-only

(b) Variety-only vs. Pool-Exp+Variety

(c) All-Path Scaling: SFT → RL

Figure 3. (a) Diversity scaling outperforms quantity scaling with 4× less data. (b) Pool expansion yields faster gains. (c) RL amplifies the diversity-scaling trend across 24 paths.

Key Takeaways
  • Diversity > Quantity: Scaling structural diversity yields stronger gains than scaling data volume — even with 4× less data.
  • Two-axis scaling: Pool expansion (more tools) and variety scaling (more toolset combinations) are complementary; combining both maximizes returns.
  • RL amplifies diversity: Reinforcement learning consistently magnifies the advantage of diversity-rich SFT checkpoints across all 24 scaling paths.

Structural Diversity Analysis

We quantify diversity at three levels: tool-pool coverage, toolset variety, and tool-use pattern complexity.

Metric Gen-DR DIVE Gain
Tools Covered 2 373 +186×
Unique Toolsets 1 46,398 +46k×
Unique Call Seqs 1,231 25,084 +20×
Unique Call Graphs 19,442 39,810 +105%
R/P Topo Classes 65 153 +135%
Tool Types / Task 1.71 3.26 +91%
Avg OOD (SFT) 22.51 32.15 +43%
R/P topology density (a) R/P topology density
Tool call frequency (b) Tool frequency
RL training dynamics

Figure 5. RL training dynamics: accuracy, tool calls, unique graphs, and R/P topologies over 100 steps.

Key Takeaways
  • Massive diversity gains: DIVE covers 373 tools (vs. 2) and generates 46k unique toolsets — orders-of-magnitude more structural diversity than quantity-scaled baselines.
  • RL discovers novel strategies: During RL training, unique call-graph structures increase steadily — the model learns new tool-composition patterns beyond SFT demonstrations.
  • Topology shift: DIVE shifts the R/P topology distribution from retrieval-dominated to a balanced retrieval-processing mix, enabling more complex reasoning chains.
  • Long-tail coverage: DIVE activates the full 373-tool pool, maintaining meaningful frequency even for rare domain-specific tools.

BibTeX

@misc{chen2026divescalingdiversityagentic,
      title={DIVE: Scaling Diversity in Agentic Task Synthesis for Generalizable Tool Use},
      author={Aili Chen and Chi Zhang and Junteng Liu and Jiangjie Chen and Chengyu Du and Yunji Li and Ming Zhong and Qin Wang and Zhengmao Zhu and Jiayuan Song and Ke Ji and Junxian He and Pengyu Zhao and Yanghua Xiao},
      year={2026},
      eprint={2603.11076},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.11076},
}