800 tasks with disjoint seed concepts but the same 373-tool pool. Each task samples a novel tool subset of 15–50 tools. This evaluates whether the agent can generalize to new seed concepts and toolset combinations within the training distribution.