Test-time compute scaling helps embodied AI, but different scaling strategies (reasoning depth, model size, memory) have different costs and benefits—smart routing based on scene context can match stronger models at 65% lower latency.
This paper introduces DIRECT, a routing framework that intelligently allocates test-time compute for vision-language models used as robot planners. Instead of uniformly scaling compute (which increases latency and cost), DIRECT analyzes scene context to decide when to use chain-of-thought reasoning, larger models, or extended memory—achieving better performance per dollar spent on real robots.