Before building bigger models, optimize your data preprocessing: context length, normalization strategy, and regularization can close most of the accuracy gap at a fraction of the computational cost.
This paper shows that simple linear models (Ridge regression) can match or beat complex deep learning architectures for time-series forecasting by carefully tuning preprocessing—context length, normalization, and regularization—rather than scaling model size.