Teaching small models through prompt-based learning (showing them correct vs incorrect answers to discriminate) works better than traditional distillation or standard RL, especially for models under 1B parameters.
This paper introduces ZPPO, a training method that improves small AI models by learning from larger teacher models without copying their exact outputs. Instead of forcing students to imitate teacher predictions, ZPPO keeps the teacher in the prompt—creating special question formats that help students learn to discriminate correct from incorrect answers and identify their own failure patterns.