Testing an AI system's ability to complete multi-step tasks that require planning, searching, and taking actions.