Native Active Perception as Reasoning for Omni-Modal Understanding

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma et al.|June 17, 2026arXiv

Key Takeaway

Active perception—where an AI agent decides what to observe rather than passively processing everything—enables better video understanding with lower computational cost and improves with more reasoning steps.

Summary

OmniAgent is an AI agent that watches videos intelligently by deciding what to pay attention to, rather than processing every frame. It combines audio, visual, and text understanding in a reasoning loop, building up memory of important moments. This approach scales better with video length and achieves top performance on video understanding benchmarks.

agents multimodal reasoning

Key Terms

pomdp agentic-reinforcement-learning active-perception test-time-scaling omni-modal