SMT decouples learning what to remember from how to update memory, enabling RNNs to train in parallel with stable gradients—potentially making RNNs competitive with Transformers for long-sequence tasks without requiring sequential computation.
This paper proposes Supervised Memory Training (SMT), a new way to train recurrent neural networks that avoids the sequential bottleneck of standard backpropagation through time. Instead of unrolling RNNs over long sequences, SMT trains a Transformer to learn what information to remember, then uses those memory labels to train the RNN in parallel.