Current AI agents struggle with multi-app tasks (37% success) and personalization—they need access to user context and device state to act intelligently, not just follow isolated instructions in sterile environments.
iOSWorld is a benchmark for testing AI agents on real iOS phones with persistent user data. Unlike sandboxed tests, it evaluates whether agents can reason about a user's identity, history, and preferences across 26 interconnected apps with realistic data like messages, transactions, and travel records.