One Token Away from Collapse: The Fragility of Instruction-Tuned Helpfulness

Erfan Baghaei Potraghloo, Seyedarmin Azizi, Souvik Kundu, Massoud Pedram|April 14, 2026arXiv

Key Takeaway

Instruction-tuned models are surprisingly brittle—trivial lexical constraints cause dramatic quality collapse, suggesting their helpfulness is coupled to narrow formatting templates rather than deep understanding.

Summary

Instruction-tuned language models lose 14-48% of response quality when simple constraints are applied (like banning a punctuation mark), while base models remain unaffected. This reveals that instruction tuning creates fragility by tying helpfulness to specific surface patterns rather than robust reasoning.

safety evaluation alignment

Key Terms

instruction-tuning instruction-following linear-probes mechanistic-interpretability surface-form-templates