iOSWorld: A Benchmark for Personally Intelligent Phone Agents

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh et al.|June 8, 2026arXiv

Key Takeaway

Current AI agents struggle with multi-app tasks (37% success) and personalization—they need access to user context and device state to act intelligently, not just follow isolated instructions in sterile environments.

Summary

iOSWorld is a benchmark for testing AI agents on real iOS phones with persistent user data. Unlike sandboxed tests, it evaluates whether agents can reason about a user's identity, history, and preferences across 26 interconnected apps with realistic data like messages, transactions, and travel records.

evaluation agents applications

Key Terms

computer-use agentic-tasks personalization multi-app-coordination benchmark