The ability to identify when a model declines to answer a request, which can indicate the model recognized a harmful or unsafe prompt.