A multi-stage attention approach that first selects relevant tokens coarsely, then applies fine-grained attention on the selected subset.