LensWalk: Agentic Video Understanding by Planning How You See in Videos

Keliang Li, Yansong Li, Hongze Shen, Mengdi Liu, Hong Chang et al.|March 25, 2026arXiv

Key Takeaway

Giving AI agents control over their visual perception—deciding what to look at and when—significantly improves video reasoning accuracy. This active observation approach works as a plug-and-play upgrade for existing vision-language models.

Summary

LensWalk is an AI framework that lets language models actively control how they watch videos while reasoning about them.

agents multimodal reasoning

Key Terms

vision-language-model agentic-framework chain-of-thought temporal-reasoning