Instruction-tuned models are surprisingly brittle—trivial lexical constraints cause dramatic quality collapse, suggesting their helpfulness is coupled to narrow formatting templates rather than deep understanding.
Instruction-tuned language models lose 14-48% of response quality when simple constraints are applied (like banning a punctuation mark), while base models remain unaffected. This reveals that instruction tuning creates fragility by tying helpfulness to specific surface patterns rather than robust reasoning.