Comparing model safety when no labeled benchmark exists for the specific language, domain, or regulatory context.