SpecEyes: Accelerating Agentic Multimodal LLMs via Speculative Perception and Planning

Haoyu Huang, Jinfa Huang, Zhongwei Wan, Xiawu Zheng, Rongrong Ji et al.|March 24, 2026arXiv

Key Takeaway

A smaller speculative model can predict an agentic system's tool-calling trajectory, enabling parallel execution and early termination of expensive operations—delivering significant speedups without accuracy loss.

Summary

SpecEyes speeds up agentic multimodal AI systems by using a lightweight model to predict what tools the main model will need, allowing expensive operations to be skipped or run in parallel. This cuts latency by 1.1-3.35x while maintaining accuracy, solving a key bottleneck in systems like OpenAI o3 that repeatedly invoke vision tools.

efficiency multimodal agents

Key Terms

speculative-decoding agentic-depth cognitive-gating multimodal-agent tool-invocation