On-policy distillation produces sparse, structured parameter updates that preserve geometric properties of on-policy training despite dense supervision—meaning you can train efficient subnetworks instead of full models without losing performance.
This paper analyzes how on-policy distillation (combining student trajectories with teacher supervision) changes model parameters. The researchers found that parameter updates are sparse and concentrated in specific layers (especially feed-forward networks), yet remain geometrically structured—updates avoid principal weight directions and target near-zero weight coordinates.