General AI Assistant benchmark with 103 text-only validation tasks spanning multi-hop reasoning, web search, and information synthesis. Tests broad generalization to unseen task distributions using general-purpose Search/Browse tools.