Dense Supervision, Sparse Updates: On the Sparsity and Geometry of On-Policy Distillation

Guo Yu, Wenlin Liu, Yulan Hu, Hao-Xuan Ma, Jun-Peng Jiang et al.|June 11, 2026arXiv

Key Takeaway

On-policy distillation produces sparse, structured parameter updates that preserve geometric properties of on-policy training despite dense supervision—meaning you can train efficient subnetworks instead of full models without losing performance.

Summary

This paper analyzes how on-policy distillation (combining student trajectories with teacher supervision) changes model parameters. The researchers found that parameter updates are sparse and concentrated in specific layers (especially feed-forward networks), yet remain geometrically structured—updates avoid principal weight directions and target near-zero weight coordinates.

training efficiency

Key Terms

on-policy-learning knowledge-distillation sparsity spectral-properties parameter-efficiency