A sparse attention method that supports gradient computation, enabling end-to-end training with learned sparsity patterns.