Decoupling world prediction and action execution into asynchronous temporal streams—where the world model runs slowly and the action model runs fast—improves both robot control performance and computational efficiency without requiring robot pretraining data.
This paper presents AHA-WAM, a robot control system that separates world prediction from action execution at different speeds. A slow video model learns long-term scene patterns while a fast action model executes short movements by reusing the video model's learned context, enabling responsive closed-loop control without redundant computation.