How users phrase queries matters as much as what they ask: benign input variations systematically change AI behavior in ways that matter for real-world deployment, especially in high-stakes domains like healthcare.
This paper shows that small, routine changes in how users phrase questions to AI models can significantly shift their outputs—a problem existing safety testing misses.