When evaluating continual learning systems on streaming data, the way you partition the stream into tasks is as important as the algorithm itself—different valid splits can produce contradictory conclusions about which method works best.
This paper reveals that how you split a continuous data stream into tasks dramatically affects continual learning benchmarks—even when using the same data and model. The authors introduce tools to measure this effect and show that different task boundaries can flip which learning method performs best, making temporal taskification a critical but often-overlooked evaluation choice.