Testing reward hypotheses by branching from shared policy checkpoints and comparing short-horizon performance to assess reward quality.