Training a model using reward signals derived from the model's own internal representations rather than external labels.