AdaCodec: A Predictive Visual Code for Video MLLMs

Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si, Chenglin Li et al.|June 1, 2026arXiv

Key Takeaway

Video MLLMs can be dramatically more efficient by encoding frames predictively: send full frames only when needed, use compact change descriptions otherwise. This cuts token usage to 1/7 while improving accuracy and reducing latency from 9+ seconds to 1.6 seconds.

Summary

AdaCodec is a new way to encode videos for AI models that intelligently decides when to send full frames versus compact descriptions of changes.

multimodal efficiency architecture

Key Terms

video-language-model visual-tokens token-budget inter-frame-changes prediction-residuals