Models can now learn to reason efficiently during streaming input instead of only after seeing everything, using fine-grained reward signals that separately optimize early thinking and final deliberation phases.
AdaSR enables language models to reason incrementally as data streams in (like audio or video), rather than waiting for complete input. It uses a new training method called Hierarchical Relative Policy Optimization to teach models when to think and how much computation to spend at each stage, balancing accuracy, speed, and efficiency.